dataset
Data objects
dataset.split.method
Methods for splitting data into train, validation and test
split_method
def split_method(
datetimes: pd.DatetimeIndex,
train_test_validation_split: Tuple[int] = (3, 1, 1),
train_test_validation_specific: TrainValidationTestSpecific = (
default_train_test_validation_specific),
method: str = "modulo",
freq: str = "D",
seed: int = 1234
) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])
Split the data into train, test and (optionally) validation sets.
method: modulo If the split is (3,1,1) then, taking all the days in the dataset: - train data will have all days that are modulo 5 remainder 0,1 or 2, i.e 1st, 2nd, 3rd, 6th, 7th, 8th, 11th .... of the whole dataset - validation data will have all days that are modulo 5 remainder 3, i.e 4th, 9th, ... - test data will have all days that are modulo 5 remainder 4, i.e 5th, 10th , ...
method: random If the split is (3,1,1) then - train data will have 60% of the data - validation data will have 20% of the data - test data will have have 20% of the data
Arguments:
datetimes
- list of datetimestrain_test_validation_split
- how the split is mademethod
- which method to use. Can be modulo or randomfreq
- This can be D=day, W=week, M=month and Y=year. This means the data is divided up by different periodsseed
- random seed used to permutate the data for the 'random' method-
train_test_validation_specific
- pydandic class of 'train', 'validation' and 'test'. These specify which data goes into which datasets -
Returns
- train, validation and test datetimes
split_by_dates
def split_by_dates(
datetimes: pd.DatetimeIndex, train_validation_datetime_split: pd.Timestamp,
validation_test_datetime_split: pd.Timestamp
) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])
Split datetimes into train, validation and test by two specific datetime splits
Note that the 'train_validation_datetime_split' should be less than the 'validation_test_datetime_split'
Arguments:
datetimes
- list of datetimestrain_validation_datetime_split
- the datetime which will split the train and validation datetimes. For example if this is '2021-01-01' then the train datetimes will end by '2021-01-01' and the validation datetimes will start at '2021-01-01'.-
validation_test_datetime_split
- the datetime which will split the validation and test datetimes -
Returns
- train, validation and test datetimes
dataset.split.split
Function to split datasets up
SplitMethod Objects
class SplitMethod(Enum)
Different split methods
SplitName Objects
class SplitName(Enum)
The name for each data split.
split_data
def split_data(datetimes: Union[List[pd.Timestamp], pd.DatetimeIndex],
method: SplitMethod,
train_test_validation_split: Tuple[int, int, int] = (3, 1, 1),
train_test_validation_specific: TrainValidationTestSpecific = (
default_train_test_validation_specific),
train_validation_test_datetime_split: Optional[List[
pd.Timestamp]] = None,
seed: int = 1234) -> SplitDateTimes
Split the date using various different methods
Arguments:
datetimes
- The datetimes to be splitmethod
- the method to be usedtrain_test_validation_split
- ratios of how the split is madeseed
- random seed used to permutate the data for the 'random' methodtrain_test_validation_specific
- pydandic class of 'train', 'validation' and 'test'. These specify which data goes into which dataset.-
train_validation_test_datetime_split
- split train, validation based on specific dates. -
Returns
- train, validation and test dataset
dataset.split
split functions
dataset.split.model
Model for splitting data
TrainValidationTestSpecific Objects
class TrainValidationTestSpecific(BaseModel)
Class on how to specifically split the data into train, validation and test.
train_validation_test
@validator("train")
def train_validation_test(cls, v, values)
Make sure there is no overlap for the train data
validation_overlap
@validator("validation")
def validation_overlap(cls, v, values)
Make sure there is no overlap for the validation data
test_overlap
@validator("test")
def test_overlap(cls, v, values)
Make sure there is no overlap for the test data
dataset.batch
batch functions
Batch Objects
class Batch(BaseModel)
Batch data object
Contains the following data sources - gsp, satellite, topogrpahic, sun, pv, nwp and datetime.
All data sources are xr.Datasets
data_sources
@property
def data_sources()
The different data sources
fake
@staticmethod
def fake(configuration: Configuration,
temporally_align_examples: bool = False)
Make fake batch object
Arguments:
configuration
- configuration of dataset-
temporally_align_examples
- ption to align examples (within the batch) in time -
Returns
- batch object
save_netcdf
def save_netcdf(batch_i: int, path: Path)
Save batch to netcdf file
Arguments:
batch_i
- the batch id, used to make the filenamepath
- the path where it will be saved. This can be local or in the cloud.
load_netcdf
@staticmethod
def load_netcdf(local_netcdf_path: Union[Path, str],
batch_idx: int,
data_sources_names: Optional[list[str]] = None) -> Batch
Load batch from netcdf file
download_batch_and_load_batch
@staticmethod
def download_batch_and_load_batch(
batch_idx,
tmp_path: str,
src_path: str,
data_sources_names: Optional[List[str]] = None) -> Batch
Download batch from src to temp
Arguments:
batch_idx
- which batch index to download and loaddata_sources_names
- list of data source namestmp_path
- the temporary path, where files are downloaded to-
src_path
- the path where files are downloaded from -
Returns
- batch object
Example Objects
class Example(BaseModel)
Single Data item
Note that this is currently not really used
data_sources
@property
def data_sources()
The different data sources
join_two_batches
def join_two_batches(
batches: List[Batch],
data_sources_names: Optional[List[str]] = None,
first_batch_examples: Optional[List[int]] = None,
second_batch_examples: Optional[List[int]] = None) -> Batch
Join two batches
Arguments:
batches
- list of batches to be mixesdata_sources_names
- list of data source namesfirst_batch_examples
- list of indexes that we should use for the first batch-
second_batch_examples
- list of indexes that we should use for the second batch -
Returns
- batch object, mixture of two given
dataset.xr_utils
Useful functions for xarray objects
- joining data arrays to datasets
- pydantic exentions model of xr.Dataset
join_list_dataset_to_batch_dataset
def join_list_dataset_to_batch_dataset(
datasets: list[xr.Dataset]) -> xr.Dataset
Join a list of data sets to a dataset by expanding dims
convert_coordinates_to_indexes_for_list_datasets
def convert_coordinates_to_indexes_for_list_datasets(
examples: List[xr.Dataset]) -> List[xr.Dataset]
Set the coords to be indices before joining into a batch
convert_coordinates_to_indexes
def convert_coordinates_to_indexes(dataset: xr.Dataset) -> xr.Dataset
Reindex dims so that it can be merged with batch.
For each dimension in dataset, change the coords to 0.. len(original_coords),
and append "_index" to the dimension name.
And save the original coordinates in original_dim_name
.
This is useful to align multiple examples into a single batch.
PydanticXArrayDataSet Objects
class PydanticXArrayDataSet(xr.Dataset)
Pydantic Xarray Dataset Class
Adapted from https://pydantic-docs.helpmanual.io/usage/types/#classes-with-get_validators
model_validation
@classmethod
def model_validation(cls, v)
Specific model validation, to be overwritten by class
__get_validators__
@classmethod
def __get_validators__(cls)
Get validators
validate
@classmethod
def validate(cls, v: Any) -> Any
Do validation
validate_dims
@classmethod
def validate_dims(cls, v: Any) -> Any
Validate the dims
validate_coords
@classmethod
def validate_coords(cls, v: Any) -> Any
Validate the coords
validate_data_vars
@classmethod
def validate_data_vars(cls, v: Any) -> Any
Validate the data vars
convert_arrays_to_uint8
def convert_arrays_to_uint8(*arrays: tuple[np.ndarray]) -> tuple[np.ndarray]
Convert multiple arrays to uint8, using the same min and max to scale all arrays.