dataset

Data objects

dataset.xr_utils

Useful functions for xarray objects

  1. joining data arrays to datasets
  2. pydantic exentions model of xr.Dataset

join_list_dataset_to_batch_dataset

def join_list_dataset_to_batch_dataset(datasets: list[xr.Dataset]) -> xr.Dataset

Join a list of data sets to a dataset by expanding dims

convert_coordinates_to_indexes_for_list_datasets

def convert_coordinates_to_indexes_for_list_datasets(examples: List[xr.Dataset]) -> List[xr.Dataset]

Set the coords to be indices before joining into a batch

convert_coordinates_to_indexes

def convert_coordinates_to_indexes(dataset: xr.Dataset) -> xr.Dataset

Reindex dims so that it can be merged with batch.

For each dimension in dataset, change the coords to 0.. len(original_coords), and append "_index" to the dimension name. And save the original coordinates in original_dim_name.

This is useful to align multiple examples into a single batch.

PydanticXArrayDataSet Objects

class PydanticXArrayDataSet(xr.Dataset)

Pydantic Xarray Dataset Class

Adapted from https://pydantic-docs.helpmanual.io/usage/types/#classes-with-get_validators

model_validation

@classmethod
def model_validation(cls, v)

Specific model validation, to be overwritten by class

__get_validators__

@classmethod
def __get_validators__(cls)

Get validators

validate

@classmethod
def validate(cls, v: Any) -> Any

Do validation

validate_dims

@classmethod
def validate_dims(cls, v: Any) -> Any

Validate the dims

validate_coords

@classmethod
def validate_coords(cls, v: Any) -> Any

Validate the coords

validate_data_vars

@classmethod
def validate_data_vars(cls, v: Any) -> Any

Validate the data vars

dataset.batch

batch functions

Batch Objects

class Batch(BaseModel)

Batch data object

Contains the following data sources - gsp, satellite, topogrpahic, sun, pv, nwp and datetime.

All data sources are xr.Datasets

data_sources

@property
def data_sources()

The different data sources

fake

@staticmethod
def fake(configuration: Configuration)

Make fake batch object

save_netcdf

def save_netcdf(batch_i: int, path: Path)

Save batch to netcdf file

Arguments:

  • batch_i - the batch id, used to make the filename
  • path - the path where it will be saved. This can be local or in the cloud.

load_netcdf

@staticmethod
def load_netcdf(local_netcdf_path: Union[Path, str], batch_idx: int)

Load batch from netcdf file

Example Objects

class Example(BaseModel)

Single Data item

Note that this is currently not really used

data_sources

@property
def data_sources()

The different data sources

dataset.split

split functions

dataset.split.split

Function to split datasets up

SplitMethod Objects

class SplitMethod(Enum)

Different split methods

SplitName Objects

class SplitName(Enum)

The name for each data split.

split_data

def split_data(datetimes: Union[List[pd.Timestamp], pd.DatetimeIndex], method: SplitMethod, train_test_validation_split: Tuple[int, int, int] = (3, 1, 1), train_test_validation_specific: TrainValidationTestSpecific = (
        default_train_test_validation_specific
    ), train_validation_test_datetime_split: Optional[List[pd.Timestamp]] = None, seed: int = 1234) -> SplitDateTimes

Split the date using various different methods

Arguments:

  • datetimes - The datetimes to be split
  • method - the method to be used
  • train_test_validation_split - ratios of how the split is made
  • seed - random seed used to permutate the data for the 'random' method
  • train_test_validation_specific - pydandic class of 'train', 'validation' and 'test'. These specify which data goes into which dataset.
  • train_validation_test_datetime_split - split train, validation based on specific dates.

  • Returns - train, validation and test dataset

dataset.split.model

Model for splitting data

TrainValidationTestSpecific Objects

class TrainValidationTestSpecific(BaseModel)

Class on how to specifically split the data into train, validation and test.

train_validation_test

@validator("train")
def train_validation_test(cls, v, values)

Make sure there is no overlap for the train data

validation_overlap

@validator("validation")
def validation_overlap(cls, v, values)

Make sure there is no overlap for the validation data

test_overlap

@validator("test")
def test_overlap(cls, v, values)

Make sure there is no overlap for the test data

dataset.split.method

Methods for splitting data into train, validation and test

split_method

def split_method(datetimes: pd.DatetimeIndex, train_test_validation_split: Tuple[int] = (3, 1, 1), train_test_validation_specific: TrainValidationTestSpecific = (
        default_train_test_validation_specific
    ), method: str = "modulo", freq: str = "D", seed: int = 1234) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])

Split the data into train, test and (optionally) validation sets.

method: modulo If the split is (3,1,1) then, taking all the days in the dataset: - train data will have all days that are modulo 5 remainder 0,1 or 2, i.e 1st, 2nd, 3rd, 6th, 7th, 8th, 11th .... of the whole dataset - validation data will have all days that are modulo 5 remainder 3, i.e 4th, 9th, ... - test data will have all days that are modulo 5 remainder 4, i.e 5th, 10th , ...

method: random If the split is (3,1,1) then - train data will have 60% of the data - validation data will have 20% of the data - test data will have have 20% of the data

Arguments:

  • datetimes - list of datetimes
  • train_test_validation_split - how the split is made
  • method - which method to use. Can be modulo or random
  • freq - This can be D=day, W=week, M=month and Y=year. This means the data is divided up by different periods
  • seed - random seed used to permutate the data for the 'random' method
  • train_test_validation_specific - pydandic class of 'train', 'validation' and 'test'. These specify which data goes into which datasets

  • Returns - train, validation and test datetimes

split_by_dates

def split_by_dates(datetimes: pd.DatetimeIndex, train_validation_datetime_split: pd.Timestamp, validation_test_datetime_split: pd.Timestamp) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])

Split datetimes into train, validation and test by two specific datetime splits

Note that the 'train_validation_datetime_split' should be less than the 'validation_test_datetime_split'

Arguments:

  • datetimes - list of datetimes
  • train_validation_datetime_split - the datetime which will split the train and validation datetimes. For example if this is '2021-01-01' then the train datetimes will end by '2021-01-01' and the validation datetimes will start at '2021-01-01'.
  • validation_test_datetime_split - the datetime which will split the validation and test datetimes

  • Returns - train, validation and test datetimes