dataset

Data objects

dataset.datamodule

Data Modules

NowcastingDataModule Objects

@dataclass
class NowcastingDataModule(pl.LightningDataModule)

Nowcasting Data Module, used to make batches

Attributes (additional to the dataclass attributes): pv_data_source: PVDataSource sat_data_source: SatelliteDataSource data_sources: List[DataSource] train_t0_datetimes: pd.DatetimeIndex val_t0_datetimes: pd.DatetimeIndex

__post_init__

def __post_init__()

Post Init

prepare_data

def prepare_data() -> None

Prepare all datasources

setup

def setup(stage="fit")

Split data, etc.

Arguments:

  • stage - {'fit', 'predict', 'test', 'validate'} This code ignores this.

## Selecting daytime data.

We're interested in forecasting solar power generation, so we don't care about nighttime data :)

In the UK in summer, the sun rises first in the north east, and sets last in the north west [1]. In summer, the north gets more hours of sunshine per day.

In the UK in winter, the sun rises first in the south east, and sets last in the south west [2]. In winter, the south gets more hours of sunshine per day.

Summer Winter
Sun rises first in N.E. S.E.
Sun sets last in N.W. S.W.
Most hours of sunlight North South

Before training, we select timesteps which have at least some sunlight. We do this by computing the clearsky global horizontal irradiance (GHI) for the four corners of the satellite imagery, and for all the timesteps in the dataset. We only use timesteps where the maximum global horizontal irradiance across all four corners is above some threshold.

The 'clearsky solar irradiance' is the amount of sunlight we'd expect on a clear day at a specific time and location. The SI unit of irradiance is watt per square meter. The 'global horizontal irradiance' (GHI) is the total sunlight that would hit a horizontal surface on the surface of the Earth. The GHI is the sum of the direct irradiance (sunlight which takes a direct path from the Sun to the Earth's surface) and the diffuse horizontal irradiance (the sunlight scattered from the atmosphere). For more info, see: https://en.wikipedia.org/wiki/Solar_irradiance

References:

  1. Video of June 2019
  2. Video of Jan 2019

train_dataloader

def train_dataloader() -> torch.utils.data.DataLoader

Train dataloader

val_dataloader

def val_dataloader() -> torch.utils.data.DataLoader

Validation dataloader

test_dataloader

def test_dataloader() -> torch.utils.data.DataLoader

Test dataloader

dataset.split

split functions

dataset.split.method

Methods for splitting data into train, validation and test

split_method

def split_method(datetimes: pd.DatetimeIndex, train_test_validation_split: Tuple[int] = (3, 1, 1), train_test_validation_specific: TrainValidationTestSpecific = default_train_test_validation_specific, method: str = "modulo", freq: str = "D", seed: int = 1234) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])

Split the data into train, test and (optionally) validation sets.

method: modulo If the split is (3,1,1) then, taking all the days in the dataset: - train data will have all days that are modulo 5 remainder 0,1 or 2, i.e 1st, 2nd, 3rd, 6th, 7th, 8th, 11th .... of the whole dataset - validation data will have all days that are modulo 5 remainder 3, i.e 4th, 9th, ... - test data will have all days that are modulo 5 remainder 4, i.e 5th, 10th , ...

method: random If the split is (3,1,1) then - train data will have 60% of the data - validation data will have 20% of the data - test data will have have 20% of the data

Arguments:

  • datetimes - list of datetimes
  • train_test_validation_split - how the split is made
  • method - which method to use. Can be modulo or random
  • freq - This can be D=day, W=week, M=month and Y=year. This means the data is divided up by different periods
  • seed - random seed used to permutate the data for the 'random' method
  • train_test_validation_specific - pydandic class of 'train', 'validation' and 'test'. These specifies which data goes into which datasets

  • Returns - train, validation and test datetimes

split_by_dates

def split_by_dates(datetimes: pd.DatetimeIndex, train_validation_datetime_split: pd.Timestamp, validation_test_datetime_split: pd.Timestamp) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])

Split datetimes into train, validation and test by two specific datetime splits

Note that the 'train_validation_datetime_split' should be less than the 'validation_test_datetime_split'

Arguments:

  • datetimes - list of datetimes
  • train_validation_datetime_split - the datetime which will split the train and validation datetimes. For example if this is '2021-01-01' then the train datetimes will end by '2021-01-01' and the validation datetimes will start at '2021-01-01'.
  • validation_test_datetime_split - the datetime which will split the validation and test datetimes

  • Returns - train, validation and test datetimes

dataset.split.model

Model for splitting data

TrainValidationTestSpecific Objects

class TrainValidationTestSpecific(BaseModel)

Class on how to specifically split the data into train, validation and test.

train_validation_test

@validator("train")
def train_validation_test(cls, v, values)

Make sure there is no overlap for the train data

validation_overlap

@validator("validation")
def validation_overlap(cls, v, values)

Make sure there is no overlap for the validation data

test_overlap

@validator("test")
def test_overlap(cls, v, values)

Make sure there is no overlap for the test data

dataset.split.split

Function to split datasets up

SplitMethod Objects

class SplitMethod(Enum)

Different split methods

split_data

def split_data(datetimes: Union[List[pd.Timestamp], pd.DatetimeIndex], method: SplitMethod, train_test_validation_split: Tuple[int] = (3, 1, 1), train_test_validation_specific: TrainValidationTestSpecific = (
        default_train_test_validation_specific
    ), train_validation_test_datetime_split: Optional[List[pd.Timestamp]] = None, seed: int = 1234) -> (List[pd.Timestamp], List[pd.Timestamp], List[pd.Timestamp])

Split the date using various different methods

Arguments:

  • datetimes - The datetimes to be split
  • method - the method to be used
  • train_test_validation_split - ratios of how the split is made
  • seed - random seed used to permutate the data for the 'random' method
  • train_test_validation_specific - pydandic class of 'train', 'validation' and 'test'. These specify which data goes into which dataset.
  • train_validation_test_datetime_split - split train, validation based on specific dates.

  • Returns - train, validation and test dataset

dataset.fake

A class to create a fake dataset

FakeDataset Objects

class FakeDataset(torch.utils.data.Dataset)

Fake dataset.

__init__

def __init__(configuration: Configuration, length: int = 10)

Init

Arguments:

  • configuration - configuration object
  • length - length of dataset

__len__

def __len__()

Number of pieces of data

per_worker_init

def per_worker_init(worker_id: int)

Nothing to do for FakeDataset

__getitem__

def __getitem__(idx)

Get item, use for iter and next method

Arguments:

  • idx - batch index

  • Returns - Dictionary of random data

dataset.datasets

Dataset and functions

logger

This file contains the following classes NetCDFDataset- torch.utils.data.Dataset: Use for loading pre-made batches NowcastingDataset - torch.utils.data.IterableDataset: Dataset for making batches

NetCDFDataset Objects

class NetCDFDataset(torch.utils.data.Dataset)

Loads data saved by the prepare_ml_training_data.py script.

Moved from predict_pv_yield

__init__

def __init__(n_batches: int, src_path: str, tmp_path: str, configuration: Configuration, cloud: str = "gcp", required_keys: Union[Tuple[str], List[str]] = None, history_minutes: Optional[int] = None, forecast_minutes: Optional[int] = None)

Netcdf Dataset

Arguments:

  • n_batches - Number of batches available on disk.
  • src_path - The full path (including 'gs://') to the data on Google Cloud storage.
  • tmp_path - The full path to the local temporary directory (on a local filesystem). cloud:
  • required_keys - Tuple or list of keys required in the example for it to be considered usable
  • history_minutes - How many past minutes of data to use, if subsetting the batch
  • forecast_minutes - How many future minutes of data to use, if reducing the amount of forecast time
  • configuration - configuration object
  • cloud - which cloud is used, can be "gcp", "aws" or "local".

per_worker_init

def per_worker_init(worker_id: int)

Function called by a worker

__len__

def __len__()

Length of dataset

__getitem__

def __getitem__(batch_idx: int) -> dict

Returns a whole batch at once.

Arguments:

  • batch_idx - The integer index of the batch. Must be in the range [0, self.n_batches).

Returns:

NamedDict where each value is a numpy array. The size of this array's first dimension is the batch size.

NowcastingDataset Objects

@dataclass
class NowcastingDataset(torch.utils.data.IterableDataset)

The first data_source will be used to select the geo locations each batch.

n_samples_per_timestep

Number of times to re-use each timestep. Must exactly divide batch_size.

__post_init__

def __post_init__()

Post Init

per_worker_init

def per_worker_init(worker_id: int) -> None

Called by worker_init_fn on each copy of NowcastingDataset

This happens after the worker process has been spawned.

__iter__

def __iter__()

Yields a complete batch at a time.

worker_init_fn

def worker_init_fn(worker_id)

Configures each dataset worker process.

  1. Get fsspec ready for multi process
  2. To call NowcastingDataset.per_worker_init().

dataset.xr_utils

Useful functions for xarray objects

  1. joining data arrays to datasets
  2. pydantic exentions model of xr.Dataset
  3. xr array and xr dataset --> to torch functions

join_list_data_array_to_batch_dataset

def join_list_data_array_to_batch_dataset(image_data_arrays: List[xr.DataArray]) -> xr.Dataset

Join a list of data arrays to a dataset byt expanding dims

join_dataset_to_batch_dataset

def join_dataset_to_batch_dataset(image_data_arrays: List[xr.Dataset]) -> xr.Dataset

Join a list of data arrays to a dataset byt expanding dims

convert_data_array_to_dataset

def convert_data_array_to_dataset(data_xarray)

Convert data array to dataset. Reindex dim so that it can be merged with batch

make_dim_index

def make_dim_index(data_xarray_dataset: xr.Dataset) -> xr.Dataset

Reindex dataset dims so that it can be merged with batch

PydanticXArrayDataSet Objects

class PydanticXArrayDataSet(xr.Dataset)

Pydantic Xarray Dataset Class

Adapted from https://pydantic-docs.helpmanual.io/usage/types/#classes-with-get_validators

model_validation

@classmethod
def model_validation(cls, v)

Specific model validation, to be overwritten by class

__get_validators__

@classmethod
def __get_validators__(cls)

Get validators

validate

@classmethod
def validate(cls, v: Any) -> Any

Do validation

validate_dims

@classmethod
def validate_dims(cls, v: Any) -> Any

Validate the dims

validate_coords

@classmethod
def validate_coords(cls, v: Any) -> Any

Validate the coords

register_xr_data_array_to_tensor

def register_xr_data_array_to_tensor()

Add torch object to data array

register_xr_data_set_to_tensor

def register_xr_data_set_to_tensor()

Add torch object to dataset

dataset.batch

batch functions

Batch Objects

class Batch(BaseModel)

Batch data object

Contains the following data sources - gsp, satellite, topogrpahic, sun, pv, nwp and datetime. Also contains metadata of the class.

All data sources are xr.Datasets

data_sources

@property
def data_sources()

The different data sources

fake

@staticmethod
def fake(configuration: Configuration = Configuration())

Make fake batch object

save_netcdf

def save_netcdf(batch_i: int, path: Path)

Save batch to netcdf file

Arguments:

  • batch_i - the batch id, used to make the filename
  • path - the path where it will be saved. This can be local or in the cloud.

load_netcdf

@staticmethod
def load_netcdf(local_netcdf_path: Union[Path, str], batch_idx: int)

Load batch from netcdf file

Example Objects

class Example(BaseModel)

Single Data item

Note that this is currently not really used

data_sources

@property
def data_sources()

The different data sources

BatchML Objects

class BatchML(Example)

Batch data object.

Contains the following data sources - gsp, satellite, topogrpahic, sun, pv, nwp and datetime. Also contains metadata of the class

fake

@staticmethod
def fake(configuration: Configuration = Configuration())

Create fake batch

from_batch

@staticmethod
def from_batch(batch: Batch) -> BatchML

Change batch to ML batch

dataset.subset

Take subsets of xr.datasets

subselect_data

def subselect_data(batch: Batch, history_minutes: int, forecast_minutes: int, current_timestep_index: Optional[int] = None) -> Batch

Subselects the data temporally. This function selects all data within the time range [t0 - history_minutes, t0 + forecast_minutes]

Arguments:

  • batch - Example dictionary containing at least the required_keys
  • required_keys - The required keys present in the dictionary to use
  • current_timestep_index - The index into either SATELLITE_DATETIME_INDEX or NWP_TARGET_TIME giving the current timestep
  • history_minutes - How many minutes of history to use
  • forecast_minutes - How many minutes of future data to use for forecasting

Returns:

Example with only data between [t0 - history_minutes, t0 + forecast_minutes] remaining

select_time_period

def select_time_period(x, history_minutes: int, forecast_minutes: int, t0_dt_of_first_example: Union[datetime, pd.Timestamp])

Selects a subset of data between the indicies of [start, end] for each key in keys

Note that class is edited so nothing is returned.

Arguments:

  • x - dataset that is ot be reduced
  • t0_dt_of_first_example - datetime of the current time (t0) in the first example of the batch
  • history_minutes - How many minutes of history to use
  • forecast_minutes - How many minutes of future data to use for forecasting