Design rationale¶

The netCDF STF 2.0 compliant format is such that a file loaded from Python via xarray is not the most convenient data model for users.

This notebook illustrates interactively the behaviors, and informs the design choices made to reconcile the xarray view with the on-disk representation.

Loading an existing reference netCDF file¶

A file was created using (probably) a Matlab implementation of STF data handling and I/O. Let's load it via xarray as well as the netCDF4 package, as we are not sure which will be most adequate for efts-io for saving/loading operations.

In [ ]:

Copied!

import xarray as xr
import xarray as xr

In [ ]:

Copied!

import netCDF4 as nc
import netCDF4 as nc

xarray.open_dataset has arguments to turn on/off the decoding of climate and forecast (SF) and related conventions.

decode_times=False is a must, otherwise the statement fails. Decoding would work for the time dimension, but decoding lead_time fails.
decode_cf seems to influence at least how the station_name variable appears, notably whether it ends up of dimensions (station, strLen) if True, or (station,) if False.

In [ ]:

Copied!

import efts_io.helpers as hlp
import efts_io.helpers as hlp

In [ ]:

Copied!

fn = hlp.derived_rainfall_tas()

rain_xr = xr.open_dataset(fn, decode_times=False, decode_cf=False)
fn = hlp.derived_rainfall_tas()

rain_xr = xr.open_dataset(fn, decode_times=False, decode_cf=False)

In [ ]:

Copied!

rain_nc = nc.Dataset(fn)
rain_nc = nc.Dataset(fn)

xarray read¶

In [ ]:

Copied!

rain_xr
rain_xr

If we use decode_cf=True, we seem to get a one dimensional array of array of bytes, rather than a matrix of bytes (type 'S1'):

In [ ]:

Copied!

print(rain_xr)
print(rain_xr)

In [ ]:

Copied!

rain_cfdecode = xr.open_dataset(fn, decode_times=False, decode_cf=True)
rain_cfdecode = xr.open_dataset(fn, decode_times=False, decode_cf=True)

In [ ]:

Copied!

rain_cfdecode.station_name
rain_cfdecode.station_name

In [ ]:

Copied!

rain_cfdecode.station_name.values[1]
rain_cfdecode.station_name.values[1]

netCDF4 read¶

In [ ]:

Copied!

rain_nc
rain_nc

Modulo the value of decode_cf for xarray.open_dataset, the shape of the data in memory appears consistent between xarray and netCDF4

Requirements¶

Desired in-memory representation¶

See this discussion for background.

We assume that an "intuitive" data representation in an xarray dataset would have the following characteristics:

The time coordinate has values with python representations np.datetime64 or similar
A station_id coordinate has values as strings rather than bytes, so that slicing can be done with statements such as data.sel(station_id="407113A"). The STF representation is such that station is a dimension/coordinate, not station_id
In the example case loaded, the variable datatypes is 32-bits np.float32 rather than 64 bits np.float. The latter is probably more convenient in most use cases we can anticipate. However we may want to consider keeping a 32 bits representation: ensemble forecasting and modelling methods can be RAM-hungry even with 2024 typical machine setups.
coordinate data is in type int32. Memory footprint is not a consideration, we may want to change is to 64 bits, or not, based on other factors.
There should be a coordinate named "realisation" (or U.S. "realization"??) rather than "ens_member"

STF 2.0 compliance¶

It is imperative to be able to export the in-memory xarray representation to a netCDF file that complies with documented conventions and is readable by existing toolsets in Matlab, C++ or even Fortran. A key question is whether we can use xarray.to_netcdf or whether we need to use the lower level package netCDF4 to achieve that.

Implementation¶

This notebook will illustrate the various steps taken to bridge the gap between the on-disk and in-memory representations.

Reading from disk¶

In [ ]:

Copied!

rain_cfdecode
rain_cfdecode

time¶

For background in issue https://jira.csiro.au/browse/WIRADA-635. We cannot have xarray automagically decoding this axis, so we need to do the work manually, but using as much as possible work already done. Not sure how I had figured out about CFDatetimeCoder, but:

In [ ]:

Copied!

from xarray.coding import times
from xarray.coding import times

In [ ]:

Copied!

decod = times.CFDatetimeCoder(use_cftime=True)
decod = times.CFDatetimeCoder(use_cftime=True)

In [ ]:

Copied!

decod.decode?
decod.decode?

We need to pass a "Variable", not a DataArray

In [ ]:

Copied!

type(rain_cfdecode.coords['time'])
type(rain_cfdecode.coords['time'])

In [ ]:

Copied!

TIME_DIMNAME="time"
var = xr.as_variable(rain_cfdecode.coords[TIME_DIMNAME])
TIME_DIMNAME="time"
var = xr.as_variable(rain_cfdecode.coords[TIME_DIMNAME])

In [ ]:

Copied!

time_zone = var.attrs["time_standard"]
time_coords = decod.decode(var, name=TIME_DIMNAME)
time_zone = var.attrs["time_standard"]
time_coords = decod.decode(var, name=TIME_DIMNAME)

In [ ]:

Copied!

time_zone
time_zone

In [ ]:

Copied!

timestamp = time_coords.values[0]
timestamp
timestamp = time_coords.values[0]
timestamp

Date/time, calendar and time zone handling are a topic of underappreciated complexity, to put it mildly. Let's look at what we get here.

Unfamiliar with this type of time stamp. It seems not to have time zone from the decoding operation, but can have it:

In [ ]:

Copied!

timestamp.tzinfo is None
timestamp.tzinfo is None

Should our new time axis hold time zone info with each time stamp, or still rely on the coordinate attribute time_standard?

In [ ]:

Copied!

from efts_io.wrapper import cftimes_to_pdtstamps
from efts_io.wrapper import cftimes_to_pdtstamps

In [ ]:

Copied!





new_time_values = cftimes_to_pdtstamps(
    time_coords.values,
    time_zone,
)
new_time_values
new_time_values = cftimes_to_pdtstamps(
    time_coords.values,
    time_zone,
)
new_time_values

This may be a suitable time axis. Depending on usage needs we may want to revisit though. In particular, users may create "naive" date time stamps from strings: how would ds.sel() then behave if time stamps have time zones??

In [ ]:

Copied!

import pandas as pd
pd.Timestamp('2000-11-15 23:00:00+0000')
import pandas as pd
pd.Timestamp('2000-11-15 23:00:00+0000')

In [ ]:

Copied!

pd.Timestamp('2000-11-15 23:00:00+0000') == new_time_values[0]
pd.Timestamp('2000-11-15 23:00:00+0000') == new_time_values[0]

In [ ]:

Copied!

pd.Timestamp('2000-11-15 23:00:00')
pd.Timestamp('2000-11-15 23:00:00')

In [ ]:

Copied!

pd.Timestamp('2000-11-15 23:00:00') == new_time_values[0]
pd.Timestamp('2000-11-15 23:00:00') == new_time_values[0]

As expected, the naive date time is not equal to the one with a time zone. Using time zone in the time stamps may be a fraught choice in practice. In particular there may be logical but unintuitive if we use a time slice to subset data

station_id¶

In [ ]:

Copied!

station_ids = rain_cfdecode.station_id.values
station_ids.dtype
station_ids = rain_cfdecode.station_id.values
station_ids.dtype

In [ ]:

Copied!

station_ids[:3]
station_ids[:3]

STF conventions are such that the station ID can only be an integer. We want a str in the in memory model. This is easy going one direction; when we consider going the other way (writing to STF 2.0) this will be trickier.

In [ ]:

Copied!

station_ids_str = [str(x) for x in station_ids]
station_ids_str = [str(x) for x in station_ids]

In [ ]:

Copied!

rain_cfdecode.station_id
rain_cfdecode.station_id

In [ ]:

Copied!

type(rain_cfdecode.station_id)
type(rain_cfdecode.station_id)

In [ ]:

Copied!

rain_cfdecode.station
rain_cfdecode.station

In [ ]:

Copied!

type(rain_cfdecode.station)
type(rain_cfdecode.station)

A key thing here is that we will promote "station_id" which is a variable, to a coordinate, so we cannot just assign dimensions; we will need to reconstruct a new xarray.

station_name¶

In [ ]:

Copied!

rain_cfdecode.station_name
rain_cfdecode.station_name

In [ ]:

Copied!

x = b'18594010'
str(x, encoding="UTF-8")
x = b'18594010'
str(x, encoding="UTF-8")

Using helper functions already included in the package at the time of writing:

In [ ]:

Copied!

from efts_io.wrapper import byte_stations_to_str
from efts_io.wrapper import byte_stations_to_str

In [ ]:

Copied!

station_name_str = byte_stations_to_str(rain_cfdecode.station_name.values)
station_name_str[:3]
station_name_str = byte_stations_to_str(rain_cfdecode.station_name.values)
station_name_str[:3]

creating a new dataset¶

The package already includes a function to create high level xarray

In [ ]:

Copied!

from efts_io import wrapper as w
from efts_io import wrapper as w

In [ ]:

Copied!

rain_cfdecode.ens_member.values
rain_cfdecode.ens_member.values

In [ ]:

Copied!





issue_times = new_time_values
station_ids = station_ids_str
lead_times = rain_cfdecode.lead_time.values
lead_time_tstep = "days"
ensemble_size = len(rain_cfdecode.ens_member.values)
station_names= station_name_str
nc_attributes = None
latitudes = rain_cfdecode.lat.values
longitudes = rain_cfdecode.lon.values
areas = rain_cfdecode.area.values

d = w.xr_efts(
    issue_times,
    station_ids,
    lead_times,
    lead_time_tstep,
    ensemble_size,
    station_names,
    latitudes,
    longitudes,
    areas,
    nc_attributes,
)
issue_times = new_time_values
station_ids = station_ids_str
lead_times = rain_cfdecode.lead_time.values
lead_time_tstep = "days"
ensemble_size = len(rain_cfdecode.ens_member.values)
station_names= station_name_str
nc_attributes = None
latitudes = rain_cfdecode.lat.values
longitudes = rain_cfdecode.lon.values
areas = rain_cfdecode.area.values

d = w.xr_efts(
    issue_times,
    station_ids,
    lead_times,
    lead_time_tstep,
    ensemble_size,
    station_names,
    latitudes,
    longitudes,
    areas,
    nc_attributes,
)

In [ ]:

Copied!

d.station_id
d.station_id

In [ ]:

Copied!

d.sizes
d.sizes

In [ ]:

Copied!

d.sel(station_id="28286670", drop=True)
d.sel(station_id="28286670", drop=True)

In [ ]:

Copied!

set(d.variables.keys())
set(d.variables.keys())

In [ ]:

Copied!

set(rain_cfdecode.variables.keys())
set(rain_cfdecode.variables.keys())

In [ ]:

Copied!

rain_cfdecode.rain_obs.dims
rain_cfdecode.rain_obs.dims

In [ ]:

Copied!

da = rain_cfdecode.rain_obs
da = rain_cfdecode.rain_obs

Assigning the data variable straight is not possible due to the differing names for the coordinate(s) for the station ids: we'd end up with 5 dimensions:

In [ ]:

Copied!

d
d

In [ ]:

Copied!

d_tmp = d.copy()
d_tmp = d.copy()

In [ ]:

Copied!

d_tmp.station
d_tmp.station

In [ ]:

Copied!

da.station
da.station

In [ ]:

Copied!

da = da.assign_coords(d_tmp.station.coords)
da = da.assign_coords(d_tmp.station.coords)

In [ ]:

Copied!

d_tmp['rain_obs'] = da
d_tmp['rain_obs'] = da

In [ ]:

Copied!

d_tmp
d_tmp

There is a DataArray.rename method to rename coordinates, but since we also have a change of values for the station and station_id coordinates, we need to do more work anyway.

In [ ]:

Copied!





# make sure we manipulate the 4D dataset: do not assume a certain order in the dimensions:
coordinates_mapping = {
    "time": "time",
    "station": "station_id",
    "ens_member": "ens_member",
    "lead_time": "lead_time",
}
list(coordinates_mapping.keys())
# make sure we manipulate the 4D dataset: do not assume a certain order in the dimensions:
coordinates_mapping = {
    "time": "time",
    "station": "station_id",
    "ens_member": "ens_member",
    "lead_time": "lead_time",
}
list(coordinates_mapping.keys())

In [ ]:

Copied!

rain_obs = rain_cfdecode.rain_obs
rain_obs
rain_obs = rain_cfdecode.rain_obs
rain_obs

In [ ]:

Copied!

d.station_id.attrs
d.station_id.attrs

time axis¶

In [ ]:

Copied!

axis = "hours since 2010-08-01 13:00:00 +0000"
axis = "hours since 2010-08-01 13:00:00 +0000"

In [ ]:

Copied!

import cftime
import cftime

In [ ]:

Copied!

cftime.time2index
cftime.time2index

Design rationale¶

Loading an existing reference netCDF file¶

xarray read¶

netCDF4 read¶

Requirements¶

Desired in-memory representation¶

STF 2.0 compliance¶

Implementation¶

Reading from disk¶

time¶

station_id¶

station_name¶

creating a new dataset¶

time axis¶

Feedback