Design rationale¶
The netCDF STF 2.0 compliant format is such that a file loaded from Python via xarray
is not the most convenient data model for users.
This notebook illustrates interactively the behaviors, and informs the design choices made to reconcile the xarray
view with the on-disk representation.
Loading an existing reference netCDF file¶
A file was created using (probably) a Matlab implementation of STF data handling and I/O. Let's load it via xarray
as well as the netCDF4
package, as we are not sure which will be most adequate for efts-io
for saving/loading operations.
import xarray as xr
import netCDF4 as nc
xarray.open_dataset
has arguments to turn on/off the decoding of climate and forecast (SF) and related conventions.
decode_times=False
is a must, otherwise the statement fails. Decoding would work for thetime
dimension, but decodinglead_time
fails.decode_cf
seems to influence at least how the station_name variable appears, notably whether it ends up of dimensions(station, strLen)
if True, or(station,)
if False.
import efts_io.helpers as hlp
fn = hlp.derived_rainfall_tas()
rain_xr = xr.open_dataset(fn, decode_times=False, decode_cf=False)
rain_nc = nc.Dataset(fn)
xarray read¶
rain_xr
If we use decode_cf=True
, we seem to get a one dimensional array of array of bytes, rather than a matrix of bytes (type 'S1'):
print(rain_xr)
rain_cfdecode = xr.open_dataset(fn, decode_times=False, decode_cf=True)
rain_cfdecode.station_name
rain_cfdecode.station_name.values[1]
netCDF4 read¶
rain_nc
Modulo the value of decode_cf
for xarray.open_dataset
, the shape of the data in memory appears consistent between xarray
and netCDF4
Requirements¶
Desired in-memory representation¶
See this discussion for background.
We assume that an "intuitive" data representation in an xarray dataset would have the following characteristics:
- The
time
coordinate has values with python representationsnp.datetime64
or similar - A
station_id
coordinate has values as strings rather than bytes, so that slicing can be done with statements such asdata.sel(station_id="407113A")
. The STF representation is such thatstation
is a dimension/coordinate, notstation_id
- In the example case loaded, the variable datatypes is 32-bits
np.float32
rather than 64 bitsnp.float
. The latter is probably more convenient in most use cases we can anticipate. However we may want to consider keeping a 32 bits representation: ensemble forecasting and modelling methods can be RAM-hungry even with 2024 typical machine setups. - coordinate data is in type
int32
. Memory footprint is not a consideration, we may want to change is to 64 bits, or not, based on other factors. - There should be a coordinate named "realisation" (or U.S. "realization"??) rather than "ens_member"
STF 2.0 compliance¶
It is imperative to be able to export the in-memory xarray representation to a netCDF
file that complies with documented conventions and is readable by existing toolsets in Matlab
, C++
or even Fortran
. A key question is whether we can use xarray.to_netcdf
or whether we need to use the lower level package netCDF4
to achieve that.
rain_cfdecode
time¶
For background in issue https://jira.csiro.au/browse/WIRADA-635. We cannot have xarray automagically decoding this axis, so we need to do the work manually, but using as much as possible work already done. Not sure how I had figured out about CFDatetimeCoder
, but:
from xarray.coding import times
decod = times.CFDatetimeCoder(use_cftime=True)
decod.decode?
We need to pass a "Variable", not a DataArray
type(rain_cfdecode.coords['time'])
TIME_DIMNAME="time"
var = xr.as_variable(rain_cfdecode.coords[TIME_DIMNAME])
time_zone = var.attrs["time_standard"]
time_coords = decod.decode(var, name=TIME_DIMNAME)
time_zone
timestamp = time_coords.values[0]
timestamp
Date/time, calendar and time zone handling are a topic of underappreciated complexity, to put it mildly. Let's look at what we get here.
Unfamiliar with this type of time stamp. It seems not to have time zone from the decoding operation, but can have it:
timestamp.tzinfo is None
Should our new time
axis hold time zone info with each time stamp, or still rely on the coordinate attribute time_standard
?
from efts_io.wrapper import cftimes_to_pdtstamps
new_time_values = cftimes_to_pdtstamps(
time_coords.values,
time_zone,
)
new_time_values
This may be a suitable time axis. Depending on usage needs we may want to revisit though. In particular, users may create "naive" date time stamps from strings: how would ds.sel()
then behave if time stamps have time zones??
import pandas as pd
pd.Timestamp('2000-11-15 23:00:00+0000')
pd.Timestamp('2000-11-15 23:00:00+0000') == new_time_values[0]
pd.Timestamp('2000-11-15 23:00:00')
pd.Timestamp('2000-11-15 23:00:00') == new_time_values[0]
As expected, the naive date time is not equal to the one with a time zone. Using time zone in the time stamps may be a fraught choice in practice. In particular there may be logical but unintuitive if we use a time slice
to subset data
See also github issue 3
new_time_values = cftimes_to_pdtstamps(
time_coords.values,
None,
)
new_time_values
station_id¶
station_ids = rain_cfdecode.station_id.values
station_ids.dtype
station_ids[:3]
STF conventions are such that the station ID can only be an integer. We want a str
in the in memory model. This is easy going one direction; when we consider going the other way (writing to STF 2.0) this will be trickier.
station_ids_str = [str(x) for x in station_ids]
rain_cfdecode.station_id
type(rain_cfdecode.station_id)
rain_cfdecode.station
type(rain_cfdecode.station)
A key thing here is that we will promote "station_id" which is a variable, to a coordinate, so we cannot just assign dimensions; we will need to reconstruct a new xarray.
station_name¶
rain_cfdecode.station_name
x = b'18594010'
str(x, encoding="UTF-8")
Using helper functions already included in the package at the time of writing:
from efts_io.wrapper import byte_stations_to_str
station_name_str = byte_stations_to_str(rain_cfdecode.station_name.values)
station_name_str[:3]
creating a new dataset¶
The package already includes a function to create high level xarray
from efts_io import wrapper as w
rain_cfdecode.ens_member.values
issue_times = new_time_values
station_ids = station_ids_str
lead_times = rain_cfdecode.lead_time.values
lead_time_tstep = "days"
ensemble_size = len(rain_cfdecode.ens_member.values)
station_names= station_name_str
nc_attributes = None
latitudes = rain_cfdecode.lat.values
longitudes = rain_cfdecode.lon.values
areas = rain_cfdecode.area.values
d = w.xr_efts(
issue_times,
station_ids,
lead_times,
lead_time_tstep,
ensemble_size,
station_names,
latitudes,
longitudes,
areas,
nc_attributes,
)
d.station_id
d.sizes
d.sel(station_id="28286670", drop=True)
set(d.variables.keys())
set(rain_cfdecode.variables.keys())
rain_cfdecode.rain_obs.dims
da = rain_cfdecode.rain_obs
Assigning the data variable straight is not possible due to the differing names for the coordinate(s) for the station ids: we'd end up with 5 dimensions:
d
d_tmp = d.copy()
d_tmp.station
da.station
da = da.assign_coords(d_tmp.station.coords)
d_tmp['rain_obs'] = da
d_tmp
There is a DataArray.rename method to rename coordinates, but since we also have a change of values for the station
and station_id
coordinates, we need to do more work anyway.
# make sure we manipulate the 4D dataset: do not assume a certain order in the dimensions:
coordinates_mapping = {
"time": "time",
"station": "station_id",
"ens_member": "ens_member",
"lead_time": "lead_time",
}
list(coordinates_mapping.keys())
rain_obs = rain_cfdecode.rain_obs
rain_obs
d.station_id.attrs
time axis¶
axis = "hours since 2010-08-01 13:00:00 +0000"
import cftime
cftime.time2index