This folder contains the code for the different data sources.
Data Sources
- gsp: Grid Supply Point data from Sheffield Solar (e.g. the estimated total solar PV power generation for each GSP region, and the geospatial shape of each GSP region).
- nwp: Numerical Weather predictions from UK Met Office
- pv: PV output data from pvoutput.org
- satellite: satellite data from ...
- sun: Sun position data (e.g. the estimated total solar PV power generation for each GSP region, and the geospatial shape of each GSP region).
- topographic: Topographic data e.g. the elevation of the land.
data_source.py
General class used for making a data source. It has the following functions
- get_batch: gets a whole batch of data for that data source. The list of 'xr.Dataset' examples are converted to
one xr.Dataset by changing the coordinates to indexes, and then joining the examples along an extra dimension.
- datetime_index: gets the all available datatimes of the source
- get_example: gets one "example" (a single consecutive sequence). Each batch is made up of multiple examples.
Each example is a 'xr.Dataset'
- get_locations_for_batch: Samples the geospatial x,y location for each example in a batch. This is useful because,
typically, we want a single DataSource to dictate the geospatial locations of the examples (for example,
we want each example to be centered on the centroid of the grid supply point region). All the other
DataSources
will use these same geospatial locations.
datasource_output.py
General pydantic model of output of the data source. Contains the following methods - save_netcdf: save to netcdf file - check_nan_and_inf: check if any values are nans or infinite - check_dataset_greater_than_or_equal_to: check values are >= a value - check_dataset_less_than_or_equal_to: check values are <= a value - check_dataset_not_equal: check values are !>= a value - check_data_var_dim: check the dimensions of a data variable
Data Source folder
Roughly each of the data source folders follows this pattern - A class which defines how to load the data source, how to select for batches etc. This inherits from 'data_source.DataSource', - A class which contains the output model of the data source, built from a xarray Dataset. This is the information used in the batches. This inherits from 'datasource_output.DataSourceOutput'.
fake
fake.batch.py
has several function to create fake Batch
data. This is useful for testing,
and hopefully useful outside this module too.
How to add a new data source
Below is a checklist of general things to do when creating a new data source. 1. Assuming that data cannot be made on the fly, create script to process data.
-
Create folder in nowcasting/data_sources with the name of the new data source
-
Create a file called
<name>_data_source.py
. This file should contain a class which inherits fromnowcasting_dataset.data_source.DataSource
. This class will need to implement theget_example
method. (there is also an option to use aget_batch
method instead)
def get_example(
self, t0_datetime_utc: pd.Timestamp, x_center_osgb: Number, y_center_osgb: Number
) -> NewDataSource:
"""
Get a single example
Args:
t0_datetime_utc: At inference time, t0 can be thought of as "now":
it is the time of the most recent observation.
Anything after t0 is "future". Anything before t0 is "history".
x_center_osgb: Center of the example in meters in the x direction in OSGB coordinates
y_center_osgb: Center of the example in meters in the y direction in OSGB coordinates
Returns:
Example containing xxx data for the selected area
"""
- Create a file called
<name>_model.py
which a class with the name of the data source. This class is an extension of an xr.Dataset with some pydantic validation
class NewDataSource(DataSourceOutput):
""" Class to store <name> data as a xr.Dataset with some validation """
# Use to store xr.Dataset data
__slots__ = ()
_expected_dimensions = ("x", "y")
@classmethod
def model_validation(cls, v):
""" Check that all values are not NaNs """
assert (v.data != np.nan).all(), "Some data values are NaNs"
return v
-
Add new data source to the
nowcasting_dataset.dataset.batch.Batch
class. -
Add new data source to
nowcasting.dataset.datamodule.NowcastingDataModule
. -
Add configuration data to configuration model (
nowcasting_dataset/config/model.py
), for example where the raw data is loaded from.
Testing
- Create a test to check that new data source is loaded correctly.
- Create a script to make test data in
scripts/generate_data_for_tests
- Create a function to make a randomly generated xr.Dataset for generating fake data in
nowcasting.dataset.fake.py
\ and to batch fake function