dataset

Data objects

dataset.split.method

Methods for splitting data into train, validation and test

split_method

def split_method(
    datetimes: pd.DatetimeIndex,
    train_test_validation_split: Tuple[int] = (3, 1, 1),
    train_test_validation_specific: TrainValidationTestSpecific = (
        default_train_test_validation_specific),
    method: str = "modulo",
    freq: str = "D",
    seed: int = 1234
) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])

Split the data into train, test and (optionally) validation sets.

method: modulo If the split is (3,1,1) then, taking all the days in the dataset: - train data will have all days that are modulo 5 remainder 0,1 or 2, i.e 1st, 2nd, 3rd, 6th, 7th, 8th, 11th .... of the whole dataset - validation data will have all days that are modulo 5 remainder 3, i.e 4th, 9th, ... - test data will have all days that are modulo 5 remainder 4, i.e 5th, 10th , ...

method: random If the split is (3,1,1) then - train data will have 60% of the data - validation data will have 20% of the data - test data will have have 20% of the data

Arguments:

  • datetimes - list of datetimes
  • train_test_validation_split - how the split is made
  • method - which method to use. Can be modulo or random
  • freq - This can be D=day, W=week, M=month and Y=year. This means the data is divided up by different periods
  • seed - random seed used to permutate the data for the 'random' method
  • train_test_validation_specific - pydandic class of 'train', 'validation' and 'test'. These specify which data goes into which datasets

  • Returns - train, validation and test datetimes

split_by_dates

def split_by_dates(
    datetimes: pd.DatetimeIndex, train_validation_datetime_split: pd.Timestamp,
    validation_test_datetime_split: pd.Timestamp
) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])

Split datetimes into train, validation and test by two specific datetime splits

Note that the 'train_validation_datetime_split' should be less than the 'validation_test_datetime_split'

Arguments:

  • datetimes - list of datetimes
  • train_validation_datetime_split - the datetime which will split the train and validation datetimes. For example if this is '2021-01-01' then the train datetimes will end by '2021-01-01' and the validation datetimes will start at '2021-01-01'.
  • validation_test_datetime_split - the datetime which will split the validation and test datetimes

  • Returns - train, validation and test datetimes

dataset.split.split

Function to split datasets up

SplitMethod Objects

class SplitMethod(Enum)

Different split methods

SplitName Objects

class SplitName(Enum)

The name for each data split.

split_data

def split_data(datetimes: Union[List[pd.Timestamp], pd.DatetimeIndex],
               method: SplitMethod,
               train_test_validation_split: Tuple[int, int, int] = (3, 1, 1),
               train_test_validation_specific: TrainValidationTestSpecific = (
                   default_train_test_validation_specific),
               train_validation_test_datetime_split: Optional[List[
                   pd.Timestamp]] = None,
               seed: int = 1234) -> SplitDateTimes

Split the date using various different methods

Arguments:

  • datetimes - The datetimes to be split
  • method - the method to be used
  • train_test_validation_split - ratios of how the split is made
  • seed - random seed used to permutate the data for the 'random' method
  • train_test_validation_specific - pydandic class of 'train', 'validation' and 'test'. These specify which data goes into which dataset.
  • train_validation_test_datetime_split - split train, validation based on specific dates.

  • Returns - train, validation and test dataset

dataset.split

split functions

dataset.split.model

Model for splitting data

TrainValidationTestSpecific Objects

class TrainValidationTestSpecific(BaseModel)

Class on how to specifically split the data into train, validation and test.

train_validation_test

@validator("train")
def train_validation_test(cls, v, values)

Make sure there is no overlap for the train data

validation_overlap

@validator("validation")
def validation_overlap(cls, v, values)

Make sure there is no overlap for the validation data

test_overlap

@validator("test")
def test_overlap(cls, v, values)

Make sure there is no overlap for the test data

dataset.batch

batch functions

Batch Objects

class Batch(BaseModel)

Batch data object

Contains the following data sources - gsp, satellite, topogrpahic, sun, pv, nwp and datetime.

All data sources are xr.Datasets

data_sources

@property
def data_sources()

The different data sources

fake

@staticmethod
def fake(configuration: Configuration,
         temporally_align_examples: bool = False)

Make fake batch object

Arguments:

  • configuration - configuration of dataset
  • temporally_align_examples - ption to align examples (within the batch) in time

  • Returns - batch object

save_netcdf

def save_netcdf(batch_i: int, path: Path)

Save batch to netcdf file

Arguments:

  • batch_i - the batch id, used to make the filename
  • path - the path where it will be saved. This can be local or in the cloud.

load_netcdf

@staticmethod
def load_netcdf(local_netcdf_path: Union[Path, str],
                batch_idx: int,
                data_sources_names: Optional[list[str]] = None) -> Batch

Load batch from netcdf file

download_batch_and_load_batch

@staticmethod
def download_batch_and_load_batch(
        batch_idx,
        tmp_path: str,
        src_path: str,
        data_sources_names: Optional[List[str]] = None) -> Batch

Download batch from src to temp

Arguments:

  • batch_idx - which batch index to download and load
  • data_sources_names - list of data source names
  • tmp_path - the temporary path, where files are downloaded to
  • src_path - the path where files are downloaded from

  • Returns - batch object

Example Objects

class Example(BaseModel)

Single Data item

Note that this is currently not really used

data_sources

@property
def data_sources()

The different data sources

join_two_batches

def join_two_batches(
        batches: List[Batch],
        data_sources_names: Optional[List[str]] = None,
        first_batch_examples: Optional[List[int]] = None,
        second_batch_examples: Optional[List[int]] = None) -> Batch

Join two batches

Arguments:

  • batches - list of batches to be mixes
  • data_sources_names - list of data source names
  • first_batch_examples - list of indexes that we should use for the first batch
  • second_batch_examples - list of indexes that we should use for the second batch

  • Returns - batch object, mixture of two given

dataset.xr_utils

Useful functions for xarray objects

  1. joining data arrays to datasets
  2. pydantic exentions model of xr.Dataset

join_list_dataset_to_batch_dataset

def join_list_dataset_to_batch_dataset(
        datasets: list[xr.Dataset]) -> xr.Dataset

Join a list of data sets to a dataset by expanding dims

convert_coordinates_to_indexes_for_list_datasets

def convert_coordinates_to_indexes_for_list_datasets(
        examples: List[xr.Dataset]) -> List[xr.Dataset]

Set the coords to be indices before joining into a batch

convert_coordinates_to_indexes

def convert_coordinates_to_indexes(dataset: xr.Dataset) -> xr.Dataset

Reindex dims so that it can be merged with batch.

For each dimension in dataset, change the coords to 0.. len(original_coords), and append "_index" to the dimension name. And save the original coordinates in original_dim_name.

This is useful to align multiple examples into a single batch.

PydanticXArrayDataSet Objects

class PydanticXArrayDataSet(xr.Dataset)

Pydantic Xarray Dataset Class

Adapted from https://pydantic-docs.helpmanual.io/usage/types/#classes-with-get_validators

model_validation

@classmethod
def model_validation(cls, v)

Specific model validation, to be overwritten by class

__get_validators__

@classmethod
def __get_validators__(cls)

Get validators

validate

@classmethod
def validate(cls, v: Any) -> Any

Do validation

validate_dims

@classmethod
def validate_dims(cls, v: Any) -> Any

Validate the dims

validate_coords

@classmethod
def validate_coords(cls, v: Any) -> Any

Validate the coords

validate_data_vars

@classmethod
def validate_data_vars(cls, v: Any) -> Any

Validate the data vars

convert_arrays_to_uint8

def convert_arrays_to_uint8(*arrays: tuple[np.ndarray]) -> tuple[np.ndarray]

Convert multiple arrays to uint8, using the same min and max to scale all arrays.