dataset
Data objects
dataset.xr_utils
Useful functions for xarray objects
- joining data arrays to datasets
- pydantic exentions model of xr.Dataset
join_list_data_array_to_batch_dataset
def join_list_data_array_to_batch_dataset(data_arrays: List[xr.DataArray]) -> xr.Dataset
Join a list of xr.DataArrays into an xr.Dataset by concatenating on the example dim.
join_list_dataset_to_batch_dataset
def join_list_dataset_to_batch_dataset(datasets: list[xr.Dataset]) -> xr.Dataset
Join a list of data sets to a dataset by expanding dims
convert_data_array_to_dataset
def convert_data_array_to_dataset(data_xarray: xr.DataArray) -> xr.Dataset
Convert data array to dataset. Reindex dim so that it can be merged with batch
make_dim_index
def make_dim_index(dataset: xr.Dataset) -> xr.Dataset
Reindex dims so that it can be merged with batch.
For each dimension in dataset, change the coords to 0.. len(original_coords),
and append "_index" to the dimension name.
And save the original coordinates in original_dim_name
.
This is useful to align multiple examples into a single batch.
PydanticXArrayDataSet Objects
class PydanticXArrayDataSet(xr.Dataset)
Pydantic Xarray Dataset Class
Adapted from https://pydantic-docs.helpmanual.io/usage/types/#classes-with-get_validators
model_validation
@classmethod
def model_validation(cls, v)
Specific model validation, to be overwritten by class
__get_validators__
@classmethod
def __get_validators__(cls)
Get validators
validate
@classmethod
def validate(cls, v: Any) -> Any
Do validation
validate_dims
@classmethod
def validate_dims(cls, v: Any) -> Any
Validate the dims
validate_coords
@classmethod
def validate_coords(cls, v: Any) -> Any
Validate the coords
dataset.batch
batch functions
Batch Objects
class Batch(BaseModel)
Batch data object
Contains the following data sources - gsp, satellite, topogrpahic, sun, pv, nwp and datetime. Also contains metadata of the class.
All data sources are xr.Datasets
data_sources
@property
def data_sources()
The different data sources
fake
@staticmethod
def fake(configuration: Configuration)
Make fake batch object
save_netcdf
def save_netcdf(batch_i: int, path: Path)
Save batch to netcdf file
Arguments:
batch_i
- the batch id, used to make the filenamepath
- the path where it will be saved. This can be local or in the cloud.
load_netcdf
@staticmethod
def load_netcdf(local_netcdf_path: Union[Path, str], batch_idx: int)
Load batch from netcdf file
Example Objects
class Example(BaseModel)
Single Data item
Note that this is currently not really used
data_sources
@property
def data_sources()
The different data sources
dataset.split
split functions
dataset.split.split
Function to split datasets up
SplitMethod Objects
class SplitMethod(Enum)
Different split methods
SplitName Objects
class SplitName(Enum)
The name for each data split.
split_data
def split_data(datetimes: Union[List[pd.Timestamp], pd.DatetimeIndex], method: SplitMethod, train_test_validation_split: Tuple[int] = (3, 1, 1), train_test_validation_specific: TrainValidationTestSpecific = (
default_train_test_validation_specific
), train_validation_test_datetime_split: Optional[List[pd.Timestamp]] = None, seed: int = 1234) -> SplitDateTimes
Split the date using various different methods
Arguments:
datetimes
- The datetimes to be splitmethod
- the method to be usedtrain_test_validation_split
- ratios of how the split is madeseed
- random seed used to permutate the data for the 'random' methodtrain_test_validation_specific
- pydandic class of 'train', 'validation' and 'test'. These specify which data goes into which dataset.-
train_validation_test_datetime_split
- split train, validation based on specific dates. -
Returns
- train, validation and test dataset
dataset.split.model
Model for splitting data
TrainValidationTestSpecific Objects
class TrainValidationTestSpecific(BaseModel)
Class on how to specifically split the data into train, validation and test.
train_validation_test
@validator("train")
def train_validation_test(cls, v, values)
Make sure there is no overlap for the train data
validation_overlap
@validator("validation")
def validation_overlap(cls, v, values)
Make sure there is no overlap for the validation data
test_overlap
@validator("test")
def test_overlap(cls, v, values)
Make sure there is no overlap for the test data
dataset.split.method
Methods for splitting data into train, validation and test
split_method
def split_method(datetimes: pd.DatetimeIndex, train_test_validation_split: Tuple[int] = (3, 1, 1), train_test_validation_specific: TrainValidationTestSpecific = (
default_train_test_validation_specific
), method: str = "modulo", freq: str = "D", seed: int = 1234) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])
Split the data into train, test and (optionally) validation sets.
method: modulo If the split is (3,1,1) then, taking all the days in the dataset: - train data will have all days that are modulo 5 remainder 0,1 or 2, i.e 1st, 2nd, 3rd, 6th, 7th, 8th, 11th .... of the whole dataset - validation data will have all days that are modulo 5 remainder 3, i.e 4th, 9th, ... - test data will have all days that are modulo 5 remainder 4, i.e 5th, 10th , ...
method: random If the split is (3,1,1) then - train data will have 60% of the data - validation data will have 20% of the data - test data will have have 20% of the data
Arguments:
datetimes
- list of datetimestrain_test_validation_split
- how the split is mademethod
- which method to use. Can be modulo or randomfreq
- This can be D=day, W=week, M=month and Y=year. This means the data is divided up by different periodsseed
- random seed used to permutate the data for the 'random' method-
train_test_validation_specific
- pydandic class of 'train', 'validation' and 'test'. These specify which data goes into which datasets -
Returns
- train, validation and test datetimes
split_by_dates
def split_by_dates(datetimes: pd.DatetimeIndex, train_validation_datetime_split: pd.Timestamp, validation_test_datetime_split: pd.Timestamp) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])
Split datetimes into train, validation and test by two specific datetime splits
Note that the 'train_validation_datetime_split' should be less than the 'validation_test_datetime_split'
Arguments:
datetimes
- list of datetimestrain_validation_datetime_split
- the datetime which will split the train and validation datetimes. For example if this is '2021-01-01' then the train datetimes will end by '2021-01-01' and the validation datetimes will start at '2021-01-01'.-
validation_test_datetime_split
- the datetime which will split the validation and test datetimes -
Returns
- train, validation and test datetimes