Design rationale¶
The netCDF STF 2.0 compliant format is such that a file loaded from Python via xarray
is not the most convenient data model for users.
This notebook illustrates interactively the behaviors, and informs the design choices made to reconcile the xarray
view with the on-disk representation.
Loading an existing reference netCDF file¶
A file was created using (probably) a Matlab implementation of STF data handling and I/O. Let's load it via xarray
as well as the netCDF4
package, as we are not sure which will be most adequate for efts-io
for saving/loading operations.
import xarray as xr
import netCDF4 as nc
xarray.open_dataset
has arguments to turn on/off the decoding of climate and forecast (SF) and related conventions.
decode_times=False
is a must, otherwise the statement fails. Decoding would work for thetime
dimension, but decodinglead_time
fails.decode_cf
seems to influence at least how the station_name variable appears, notably whether it ends up of dimensions(station, strLen)
if True, or(station,)
if False.
fn = "/home/per202/data/sf/sample/HT_swiftRain_daily_stfv2_2000111523+0000-2023111023+0000.nc"
rain_xr = xr.open_dataset(fn, decode_times=False, decode_cf=False)
rain_nc = nc.Dataset(fn)
xarray read¶
rain_xr
<xarray.Dataset> Size: 19MB Dimensions: (time: 8396, station: 563, lead_time: 1, strLen: 30, ens_member: 1) Coordinates: * time (time) int32 34kB 1 2 3 4 5 6 ... 8392 8393 8394 8395 8396 * station (station) int32 2kB 1 2 3 4 5 6 7 ... 558 559 560 561 562 563 * lead_time (lead_time) int32 4B 0 * ens_member (ens_member) int32 4B 1 Dimensions without coordinates: strLen Data variables: station_id (station) int32 2kB ... station_name (station, strLen) |S1 17kB ... lat (station) float32 2kB ... lon (station) float32 2kB ... area (station) float32 2kB ... rain_obs (time, ens_member, station, lead_time) float32 19MB ... Attributes: title: Precip from Hydro Tasmania's observation network... institution: CSIRO Land & Water source: catchment: Hydro Tas STF_convention_version: 2.0 STF_nc_spec: https://wiki.csiro.au/display/wirada/NetCDF+for+... comment: history: 2024-07-25 15:28:34 +10.0 - File created
If we use decode_cf=True
, we seem to get a one dimensional array of array of bytes, rather than a matrix of bytes (type 'S1'):
rain_cfdecode = xr.open_dataset(fn, decode_times=False, decode_cf=True)
rain_cfdecode.station_name
<xarray.DataArray 'station_name' (station: 563)> Size: 17kB [563 values with dtype=|S30] Coordinates: * station (station) int32 2kB 1 2 3 4 5 6 7 8 ... 557 558 559 560 561 562 563 Attributes: long_name: station or node name
rain_cfdecode.station_name.values[1]
np.bytes_(b'19629011')
netCDF4 read¶
rain_nc
<class 'netCDF4._netCDF4.Dataset'> root group (NETCDF3_CLASSIC data model, file format NETCDF3): title: Precip from Hydro Tasmania's observation network areally averaged with inverse distance squared weighting institution: CSIRO Land & Water source: catchment: Hydro Tas STF_convention_version: 2.0 STF_nc_spec: https://wiki.csiro.au/display/wirada/NetCDF+for+SWIFT comment: history: 2024-07-25 15:28:34 +10.0 - File created dimensions(sizes): time(8396), station(563), lead_time(1), strLen(30), ens_member(1) variables(dimensions): int32 time(time), int32 station(station), int32 lead_time(lead_time), int32 station_id(station), |S1 station_name(station, strLen), int32 ens_member(ens_member), float32 lat(station), float32 lon(station), float32 area(station), float32 rain_obs(time, ens_member, station, lead_time) groups:
Modulo the value of decode_cf
for xarray.open_dataset
, the shape of the data in memory appears consistent between xarray
and netCDF4
Requirements¶
Desired in-memory representation¶
See this discussion for background.
We assume that an "intuitive" data representation in an xarray dataset would have the following characteristics:
- The
time
coordinate has values with python representationsnp.datetime64
or similar - A
station_id
coordinate has values as strings rather than bytes, so that slicing can be done with statements such asdata.sel(station_id="407113A")
. The STF representation is such thatstation
is a dimension/coordinate, notstation_id
- In the example case loaded, the variable datatypes is 32-bits
np.float32
rather than 64 bitsnp.float
. The latter is probably more convenient in most use cases we can anticipate. However we may want to consider keeping a 32 bits representation: ensemble forecasting and modelling methods can be RAM-hungry even with 2024 typical machine setups. - coordinate data is in type
int32
. Memory footprint is not a consideration, we may want to change is to 64 bits, or not, based on other factors. - There should be a coordinate named "realisation" (or U.S. "realization"??) rather than "ens_member"
STF 2.0 compliance¶
It is imperative to be able to export the in-memory xarray representation to a netCDF
file that complies with documented conventions and is readable by existing toolsets in Matlab
, C++
or even Fortran
. A key question is whether we can use xarray.to_netcdf
or whether we need to use the lower level package netCDF4
to achieve that.
rain_cfdecode
<xarray.Dataset> Size: 19MB Dimensions: (time: 8396, station: 563, lead_time: 1, ens_member: 1) Coordinates: * time (time) int32 34kB 1 2 3 4 5 6 ... 8392 8393 8394 8395 8396 * station (station) int32 2kB 1 2 3 4 5 6 7 ... 558 559 560 561 562 563 * lead_time (lead_time) int32 4B 0 * ens_member (ens_member) int32 4B 1 Data variables: station_id (station) int32 2kB ... station_name (station) |S30 17kB b'18594010' b'19629011' ... b'28294677' lat (station) float32 2kB ... lon (station) float32 2kB ... area (station) float32 2kB ... rain_obs (time, ens_member, station, lead_time) float32 19MB ... Attributes: title: Precip from Hydro Tasmania's observation network... institution: CSIRO Land & Water source: catchment: Hydro Tas STF_convention_version: 2.0 STF_nc_spec: https://wiki.csiro.au/display/wirada/NetCDF+for+... comment: history: 2024-07-25 15:28:34 +10.0 - File created
time¶
For background in issue https://jira.csiro.au/browse/WIRADA-635. We cannot have xarray automagically decoding this axis, so we need to do the work manually, but using as much as possible work already done. Not sure how I had figured out about CFDatetimeCoder
, but:
from xarray.coding import times
decod = times.CFDatetimeCoder(use_cftime=True)
decod.decode?
Signature: decod.decode(variable: 'Variable', name: 'T_Name' = None) -> 'Variable' Docstring: Convert an decoded variable to a encoded variable File: ~/src/efts-io/.venv/lib/python3.12/site-packages/xarray/coding/times.py Type: method
We need to pass a "Variable", not a DataArray
type(rain_cfdecode.coords['time'])
xarray.core.dataarray.DataArray
TIME_DIMNAME="time"
var = xr.as_variable(rain_cfdecode.coords[TIME_DIMNAME])
time_zone = var.attrs["time_standard"]
time_coords = decod.decode(var, name=TIME_DIMNAME)
time_zone
'UTC'
timestamp = time_coords.values[0]
timestamp
cftime.DatetimeGregorian(2000, 11, 15, 23, 0, 0, 0, has_year_zero=False)
Date/time, calendar and time zone handling are a topic of underappreciated complexity, to put it mildly. Let's look at what we get here.
Unfamiliar with this type of time stamp. It seems not to have time zone from the decoding operation, but can have it:
timestamp.tzinfo is None
True
Should our new time
axis hold time zone info with each time stamp, or still rely on the coordinate attribute time_standard
?
from efts_io.wrapper import cftimes_to_pdtstamps
new_time_values = cftimes_to_pdtstamps(
time_coords.values,
time_zone,
)
new_time_values
array([Timestamp('2000-11-15 23:00:00+0000', tz='UTC'), Timestamp('2000-11-16 23:00:00+0000', tz='UTC'), Timestamp('2000-11-17 23:00:00+0000', tz='UTC'), ..., Timestamp('2023-11-08 23:00:00+0000', tz='UTC'), Timestamp('2023-11-09 23:00:00+0000', tz='UTC'), Timestamp('2023-11-10 23:00:00+0000', tz='UTC')], dtype=object)
This may be a suitable time axis. Depending on usage needs we may want to revisit though. In particular, users may create "naive" date time stamps from strings: how would ds.sel()
then behave if time stamps have time zones??
import pandas as pd
pd.Timestamp('2000-11-15 23:00:00+0000')
Timestamp('2000-11-15 23:00:00+0000', tz='UTC')
pd.Timestamp('2000-11-15 23:00:00+0000') == new_time_values[0]
True
pd.Timestamp('2000-11-15 23:00:00')
Timestamp('2000-11-15 23:00:00')
pd.Timestamp('2000-11-15 23:00:00') == new_time_values[0]
False
As expected, the naive date time is not equal to the one with a time zone. Using time zone in the time stamps may be a fraught choice in practice. In particular there may be logical but unintuitive if we use a time slice
to subset data
See also github issue 3
new_time_values = cftimes_to_pdtstamps(
time_coords.values,
None,
)
new_time_values
array([Timestamp('2000-11-15 23:00:00'), Timestamp('2000-11-16 23:00:00'), Timestamp('2000-11-17 23:00:00'), ..., Timestamp('2023-11-08 23:00:00'), Timestamp('2023-11-09 23:00:00'), Timestamp('2023-11-10 23:00:00')], dtype=object)
station_id¶
station_ids = rain_cfdecode.station_id.values
station_ids.dtype
dtype('int32')
station_ids[:3]
array([18594010, 19629011, 18562015], dtype=int32)
STF conventions are such that the station ID can only be an integer. We want a str
in the in memory model. This is easy going one direction; when we consider going the other way (writing to STF 2.0) this will be trickier.
station_ids_str = [str(x) for x in station_ids]
rain_cfdecode.station_id
<xarray.DataArray 'station_id' (station: 563)> Size: 2kB array([18594010, 19629011, 18562015, ..., 28286670, 28294676, 28294677], dtype=int32) Coordinates: * station (station) int32 2kB 1 2 3 4 5 6 7 8 ... 557 558 559 560 561 562 563 Attributes: long_name: station or node identification code
type(rain_cfdecode.station_id)
xarray.core.dataarray.DataArray
rain_cfdecode.station
<xarray.DataArray 'station' (station: 563)> Size: 2kB array([ 1, 2, 3, ..., 561, 562, 563], dtype=int32) Coordinates: * station (station) int32 2kB 1 2 3 4 5 6 7 8 ... 557 558 559 560 561 562 563
type(rain_cfdecode.station)
xarray.core.dataarray.DataArray
A key thing here is that we will promote "station_id" which is a variable, to a coordinate, so we cannot just assign dimensions; we will need to reconstruct a new xarray.
station_name¶
rain_cfdecode.station_name
<xarray.DataArray 'station_name' (station: 563)> Size: 17kB array([b'18594010', b'19629011', b'18562015', ..., b'28286670', b'28294676', b'28294677'], dtype='|S30') Coordinates: * station (station) int32 2kB 1 2 3 4 5 6 7 8 ... 557 558 559 560 561 562 563 Attributes: long_name: station or node name
x = b'18594010'
str(x, encoding="UTF-8")
'18594010'
Using helper functions already included in the package at the time of writing:
from efts_io.wrapper import byte_stations_to_str
station_name_str = byte_stations_to_str(rain_cfdecode.station_name.values)
station_name_str[:3]
array(['18594010', '19629011', '18562015'], dtype='<U8')
creating a new dataset¶
The package already includes a function to create high level xarray
from efts_io import wrapper as w
rain_cfdecode.ens_member.values
array([1], dtype=int32)
issue_times = new_time_values
station_ids = station_ids_str
lead_times = rain_cfdecode.lead_time.values
lead_time_tstep = "days"
ensemble_size = len(rain_cfdecode.ens_member.values)
station_names= station_name_str
nc_attributes = None
latitudes = rain_cfdecode.lat.values
longitudes = rain_cfdecode.lon.values
areas = rain_cfdecode.area.values
d = w.xr_efts(
issue_times,
station_ids,
lead_times,
lead_time_tstep,
ensemble_size,
station_names,
latitudes,
longitudes,
areas,
nc_attributes,
)
d.sizes
Frozen({'station': 563, 'time': 8396, 'ens_member': 1, 'lead_time': 1})
d.sel(station_id="18594010", drop=True)
<xarray.Dataset> Size: 67kB Dimensions: (time: 8396, ens_member: 1, lead_time: 1) Coordinates: * time (time) datetime64[ns] 67kB 2000-11-15T23:00:00 ... 2023-11-... * ens_member (ens_member) int64 8B 1 * lead_time (lead_time) int32 4B 0 Data variables: station_name <U8 32B '18594010' lat float32 4B -41.63 lon float32 4B 146.3 area float32 4B 8.743 Attributes: title: not provided institution: not provided catchment: not provided source: not provided comment: not provided history: not provided STF_convention_version: 2.0 STF_nc_spec: https://github.com/csiro-hydroinformatics/efts/b...
set(d.variables.keys())
{'area', 'ens_member', 'lat', 'lead_time', 'lon', 'station', 'station_id', 'station_name', 'time'}
set(rain_cfdecode.variables.keys())
{'area', 'ens_member', 'lat', 'lead_time', 'lon', 'rain_obs', 'station', 'station_id', 'station_name', 'time'}
rain_cfdecode.rain_obs.dims
('time', 'ens_member', 'station', 'lead_time')
da = rain_cfdecode.rain_obs
Assigning the data variable straight is not possible due to the differing names for the coordinate(s) for the station ids: we'd end up with 5 dimensions:
d
<xarray.Dataset> Size: 114kB Dimensions: (station: 563, time: 8396, ens_member: 1, lead_time: 1) Coordinates: * time (time) datetime64[ns] 67kB 2000-11-15T23:00:00 ... 2023-11-... * station (station) int64 5kB 1 2 3 4 5 6 7 ... 558 559 560 561 562 563 * ens_member (ens_member) int64 8B 1 * lead_time (lead_time) int32 4B 0 * station_id (station) <U8 18kB '18594010' '19629011' ... '28294677' Data variables: station_name (station) <U8 18kB '18594010' '19629011' ... '28294677' lat (station) float32 2kB -41.63 -41.72 -41.78 ... -41.82 -41.85 lon (station) float32 2kB 146.3 146.4 146.2 ... 145.6 145.6 145.6 area (station) float32 2kB 8.743 8.05 24.13 ... 3.353 1.76 4.988 Attributes: title: not provided institution: not provided catchment: not provided source: not provided comment: not provided history: not provided STF_convention_version: 2.0 STF_nc_spec: https://github.com/csiro-hydroinformatics/efts/b...
d_tmp = d.copy()
d_tmp.station
<xarray.DataArray 'station' (station: 563)> Size: 5kB array([ 1, 2, 3, ..., 561, 562, 563]) Coordinates: * station (station) int64 5kB 1 2 3 4 5 6 7 ... 558 559 560 561 562 563 * station_id (station) <U8 18kB '18594010' '19629011' ... '28294677'
da.station
<xarray.DataArray 'station' (station: 563)> Size: 2kB array([ 1, 2, 3, ..., 561, 562, 563], dtype=int32) Coordinates: * station (station) int32 2kB 1 2 3 4 5 6 7 8 ... 557 558 559 560 561 562 563
da = da.assign_coords(d_tmp.station.coords)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[51], line 1 ----> 1 da = da.assign_coords(d_tmp.station.coords) File ~/src/efts-io/.venv/lib/python3.12/site-packages/xarray/core/common.py:645, in DataWithCoords.assign_coords(self, coords, **coords_kwargs) 642 else: 643 results = self._calc_assign_results(coords_combined) --> 645 data.coords.update(results) 646 return data File ~/src/efts-io/.venv/lib/python3.12/site-packages/xarray/core/coordinates.py:554, in Coordinates.update(self, other) 546 # Discard original indexed coordinates prior to merge allows to: 547 # - fail early if the new coordinates don't preserve the integrity of existing 548 # multi-coordinate indexes 549 # - drop & replace coordinates without alignment (note: we must keep indexed 550 # coordinates extracted from the DataArray objects passed as values to 551 # `other` - if any - as those are still used for aligning the old/new coordinates) 552 coords_to_align = drop_indexed_coords(set(other_coords) & set(other), self) --> 554 coords, indexes = merge_coords( 555 [coords_to_align, other_coords], 556 priority_arg=1, 557 indexes=coords_to_align.xindexes, 558 ) 560 # special case for PandasMultiIndex: updating only its dimension coordinate 561 # is still allowed but depreciated. 562 # It is the only case where we need to actually drop coordinates here (multi-index levels) 563 # TODO: remove when removing PandasMultiIndex's dimension coordinate. 564 self._drop_coords(self._names - coords_to_align._names) File ~/src/efts-io/.venv/lib/python3.12/site-packages/xarray/core/merge.py:556, in merge_coords(objects, compat, join, priority_arg, indexes, fill_value) 554 _assert_compat_valid(compat) 555 coerced = coerce_pandas_values(objects) --> 556 aligned = deep_align( 557 coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value 558 ) 559 collected = collect_variables_and_indexes(aligned, indexes=indexes) 560 prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat) File ~/src/efts-io/.venv/lib/python3.12/site-packages/xarray/core/alignment.py:946, in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value) 943 else: 944 out.append(variables) --> 946 aligned = align( 947 *targets, 948 join=join, 949 copy=copy, 950 indexes=indexes, 951 exclude=exclude, 952 fill_value=fill_value, 953 ) 955 for position, key, aligned_obj in zip(positions, keys, aligned): 956 if key is no_key: File ~/src/efts-io/.venv/lib/python3.12/site-packages/xarray/core/alignment.py:882, in align(join, copy, indexes, exclude, fill_value, *objects) 686 """ 687 Given any number of Dataset and/or DataArray objects, returns new 688 objects with aligned indexes and dimension sizes. (...) 872 873 """ 874 aligner = Aligner( 875 objects, 876 join=join, (...) 880 fill_value=fill_value, 881 ) --> 882 aligner.align() 883 return aligner.results File ~/src/efts-io/.venv/lib/python3.12/site-packages/xarray/core/alignment.py:573, in Aligner.align(self) 571 self.find_matching_indexes() 572 self.find_matching_unindexed_dims() --> 573 self.assert_no_index_conflict() 574 self.align_indexes() 575 self.assert_unindexed_dim_sizes_equal() File ~/src/efts-io/.venv/lib/python3.12/site-packages/xarray/core/alignment.py:318, in Aligner.assert_no_index_conflict(self) 314 if dup: 315 items_msg = ", ".join( 316 f"{k!r} ({v} conflicting indexes)" for k, v in dup.items() 317 ) --> 318 raise ValueError( 319 "cannot re-index or align objects with conflicting indexes found for " 320 f"the following {msg}: {items_msg}\n" 321 "Conflicting indexes may occur when\n" 322 "- they relate to different sets of coordinate and/or dimension names\n" 323 "- they don't have the same type\n" 324 "- they may be used to reindex data along common dimensions" 325 ) ValueError: cannot re-index or align objects with conflicting indexes found for the following dimensions: 'station' (2 conflicting indexes) Conflicting indexes may occur when - they relate to different sets of coordinate and/or dimension names - they don't have the same type - they may be used to reindex data along common dimensions
d_tmp['rain_obs'] = da
d_tmp
There is a DataArray.rename method to rename coordinates, but since we also have a change of values for the station
and station_id
coordinates, we need to do more work anyway.
# make sure we manipulate the 4D dataset: do not assume a certain order in the dimensions:
coordinates_mapping = {
"time": "time",
"station": "station_id",
"ens_member": "ens_member",
"lead_time": "lead_time",
}
list(coordinates_mapping.keys())
['time', 'station', 'ens_member', 'lead_time']
rain_obs = rain_cfdecode.rain_obs
rain_obs
<xarray.DataArray 'rain_obs' (time: 8396, ens_member: 1, station: 563, lead_time: 1)> Size: 19MB [4726948 values with dtype=float32] Coordinates: * time (time) int32 34kB 1 2 3 4 5 6 ... 8391 8392 8393 8394 8395 8396 * station (station) int32 2kB 1 2 3 4 5 6 7 ... 558 559 560 561 562 563 * lead_time (lead_time) int32 4B 0 * ens_member (ens_member) int32 4B 1 Attributes: standard_name: rain_obs long_name: observed rainfall units: mm type: 2.0 type_description: accumulated over the preceding interval dat_type: der dat_type_description: derived (from observations) location_type: area
d.station_id.attrs
{'long_name': 'station or node identification code'}
time axis¶
axis = "hours since 2010-08-01 13:00:00 +0000"
decod.decode(
Cell In[56], line 1 decod.decode( ^ SyntaxError: incomplete input
import cftime
cftime.time2index