Sample code for log-likelihood calibration¶

About this document¶

In [1]:

Copied!

from swift2.doc_helper import pkg_versions_info

print(pkg_versions_info("This document was generated from a jupyter notebook"))
from swift2.doc_helper import pkg_versions_info

print(pkg_versions_info("This document was generated from a jupyter notebook"))

This document was generated from a jupyter notebook on 2025-05-16 14:18:16.115670
    swift2 2.5.1
    uchronia 2.6.2

Setting up a calibration on daily data¶

We will use some sample data from (MMH) included in the package

In [2]:

Copied!

import numpy as np
import pandas as pd
import xarray as xr
import numpy as np
import pandas as pd
import xarray as xr

In [3]:

Copied!





from cinterop.timeseries import as_timestamp
from swift2.doc_helper import get_free_params, sample_series, set_loglik_param_keys
from swift2.parameteriser import (
    concatenate_parameterisers,
    create_parameter_sampler,
    create_parameteriser,
    create_sce_termination_wila,
    extract_optimisation_log,
    get_default_sce_parameters,
    parameteriser_as_dataframe,
    sort_by_score,
)
from swift2.simulation import create_subarea
from swift2.utils import c, mk_full_data_id, paste0
from swift2.vis import OptimisationPlots

s = as_timestamp('1990-01-01')
e = as_timestamp('2005-12-31')

rain = sample_series('MMH', 'rain')[slice(s, e)]
evap = sample_series('MMH', 'evap')[slice(s, e)]
flow = sample_series('MMH', 'flow')[slice(s, e)]
from cinterop.timeseries import as_timestamp
from swift2.doc_helper import get_free_params, sample_series, set_loglik_param_keys
from swift2.parameteriser import (
    concatenate_parameterisers,
    create_parameter_sampler,
    create_parameteriser,
    create_sce_termination_wila,
    extract_optimisation_log,
    get_default_sce_parameters,
    parameteriser_as_dataframe,
    sort_by_score,
)
from swift2.simulation import create_subarea
from swift2.utils import c, mk_full_data_id, paste0
from swift2.vis import OptimisationPlots

s = as_timestamp('1990-01-01')
e = as_timestamp('2005-12-31')

rain = sample_series('MMH', 'rain')[slice(s, e)]
evap = sample_series('MMH', 'evap')[slice(s, e)]
flow = sample_series('MMH', 'flow')[slice(s, e)]

In [4]:

Copied!

rain.describe()
rain.describe()

Out[4]:

count    5844.000000
mean        3.545405
std         7.737554
min         0.000000
25%         0.000000
50%         0.283600
75%         3.308775
max        97.645500
dtype: float64

In [5]:

Copied!

flow.describe()
flow.describe()

Out[5]:

count    5844.000000
mean       -1.993059
std        16.361702
min       -99.999000
25%         0.194400
50%         0.438400
75%         0.900200
max        17.221100
dtype: float64

We need to adjust the observed flow, as the SWIFTv1 legacy missing value code is -99.

In [6]:

Copied!

flow[flow < 0] = np.nan
flow[flow < 0] = np.nan

In [7]:

Copied!

flow
flow

Out[7]:

1990-01-01    0.2577
1990-01-02    0.2459
1990-01-03    0.2374
1990-01-04    0.2218
1990-01-05    0.2127
               ...  
2005-12-27    0.3477
2005-12-28    0.3314
2005-12-29    0.3333
2005-12-30    0.3066
2005-12-31    0.2896
Length: 5844, dtype: float64

Catchment setup¶

Let's create a single catchment setup, using daily data. We need to specify the simulation time step to be consistent with the daily input data.

In [8]:

Copied!





ms = create_subarea('GR4J', 1.0)
from cinterop.timeseries import xr_ts_end, xr_ts_start

s = xr_ts_start(rain)
e = xr_ts_end(rain)
ms.set_simulation_span(s, e)
ms.set_simulation_time_step('daily')
ms = create_subarea('GR4J', 1.0)
from cinterop.timeseries import xr_ts_end, xr_ts_start

s = xr_ts_start(rain)
e = xr_ts_end(rain)
ms.set_simulation_span(s, e)
ms.set_simulation_time_step('daily')

Assign input time series

In [9]:

Copied!

sa_name = ms.get_subarea_names()[0]
ms.play_subarea_input(rain, sa_name, "P")
ms.play_subarea_input(evap, sa_name, "E")
sa_name = ms.get_subarea_names()[0]
ms.play_subarea_input(rain, sa_name, "P")
ms.play_subarea_input(evap, sa_name, "E")

Model variables identifiers are hierarchical, with separators '.' and '|' supported. The "dot" notation should now be preferred, as some R functions producing data frames may change the variable names and replace some characters with '.'.

In [10]:

Copied!

sa_id = paste0("subarea.", sa_name)
root_id = paste0(sa_id, ".")
print(ms.get_variable_ids(sa_id))
sa_id = paste0("subarea.", sa_name)
root_id = paste0(sa_id, ".")
print(ms.get_variable_ids(sa_id))

['subarea.Subarea.areaKm2', 'subarea.Subarea.P', 'subarea.Subarea.E', 'subarea.Subarea.En', 'subarea.Subarea.LAI', 'subarea.Subarea.runoff', 'subarea.Subarea.S', 'subarea.Subarea.R', 'subarea.Subarea.TWS', 'subarea.Subarea.Eactual', 'subarea.Subarea.Ps', 'subarea.Subarea.Es', 'subarea.Subarea.Pr', 'subarea.Subarea.ech1', 'subarea.Subarea.ech2', 'subarea.Subarea.Perc', 'subarea.Subarea.alpha', 'subarea.Subarea.k', 'subarea.Subarea.x1', 'subarea.Subarea.x2', 'subarea.Subarea.x3', 'subarea.Subarea.x4', 'subarea.Subarea.UHExponent', 'subarea.Subarea.PercFactor', 'subarea.Subarea.OutflowVolume', 'subarea.Subarea.OutflowRate']

In [11]:

Copied!

gr4_state_names = paste0(root_id, c('runoff', 'S', 'R', 'Perc'))
for name in gr4_state_names: 
    ms.record_state(name)
gr4_state_names = paste0(root_id, c('runoff', 'S', 'R', 'Perc'))
for name in gr4_state_names: 
    ms.record_state(name)

Let's check that one simulation runs fine, before we build a calibration definition.

In [12]:

Copied!

ms.exec_simulation()
sState = ms.get_recorded(gr4_state_names[2])
ms.exec_simulation()
sState = ms.get_recorded(gr4_state_names[2])

In [13]:

Copied!

sState.plot(figsize=(10,4))
sState.plot(figsize=(10,4))

Out[13]:

[<matplotlib.lines.Line2D at 0x7fa7000164d0>]

No description has been provided for this image

Let's build the objective calculator that will guide the calibration process:

In [14]:

Copied!

w = pd.Timestamp("1992-01-01")
w = pd.Timestamp("1992-01-01")

In [15]:

Copied!





runoff_depth_varname = 'subarea.Subarea.runoff'
mod_runoff = ms.get_recorded(runoff_depth_varname)
# zoo::index(flow) = zoo::index(mod_runoff)
objective = ms.create_objective(runoff_depth_varname, flow, 'log-likelihood', w, e)
runoff_depth_varname = 'subarea.Subarea.runoff'
mod_runoff = ms.get_recorded(runoff_depth_varname)
# zoo::index(flow) = zoo::index(mod_runoff)
objective = ms.create_objective(runoff_depth_varname, flow, 'log-likelihood', w, e)

In [16]:

Copied!

mod_runoff.plot()
mod_runoff.plot()

Out[16]:

[<matplotlib.lines.Line2D at 0x7fa628937250>]

Parameterisation¶

Define the feasible parameter space, using a generic parameter set for the model parameters. This is 'wrapped' by a log-likelihood parameter set with the extra parameters used in the log likelihood calculation, but which exposes all the parameters as 8 independent degrees of freedom to the optimiser.

In [17]:

Copied!





pspec_gr4j = get_free_params('GR4J')
pspec_gr4j.Value = c(542.1981111, -0.4127542, 7.7403390, 1.2388548)
pspec_gr4j.Min = c(1,-30, 1,1)
pspec_gr4j.Max = c(3000, 30, 1000, 240)
pspec_gr4j.Name = paste0(root_id, pspec_gr4j.Name)


maxobs = np.max(flow)
p = create_parameteriser(type='Generic', specs=pspec_gr4j)
set_loglik_param_keys(a='a', b='b', m='m', s='s', ct="ct", censopt='censopt')
censor_threshold = maxobs / 100 # TBC
censopt = 0.0

loglik = create_parameteriser(type='no apply')
loglik.add_to_hypercube( 
          pd.DataFrame({ 
          "Name": c('b','m','s','a','maxobs','ct', 'censopt'),
          "Min": c(-30, 0, -10,    -20, maxobs, censor_threshold, censopt),
          "Max":  c(5,   0, 10, 0, maxobs, censor_threshold, censopt),
          "Value": c(-7,  0, 0,  -10, maxobs, censor_threshold, censopt),
          }
          ) )
p = concatenate_parameterisers(p, loglik)
p.as_dataframe()
pspec_gr4j = get_free_params('GR4J')
pspec_gr4j.Value = c(542.1981111, -0.4127542, 7.7403390, 1.2388548)
pspec_gr4j.Min = c(1,-30, 1,1)
pspec_gr4j.Max = c(3000, 30, 1000, 240)
pspec_gr4j.Name = paste0(root_id, pspec_gr4j.Name)


maxobs = np.max(flow)
p = create_parameteriser(type='Generic', specs=pspec_gr4j)
set_loglik_param_keys(a='a', b='b', m='m', s='s', ct="ct", censopt='censopt')
censor_threshold = maxobs / 100 # TBC
censopt = 0.0

loglik = create_parameteriser(type='no apply')
loglik.add_to_hypercube( 
          pd.DataFrame({ 
          "Name": c('b','m','s','a','maxobs','ct', 'censopt'),
          "Min": c(-30, 0, -10,    -20, maxobs, censor_threshold, censopt),
          "Max":  c(5,   0, 10, 0, maxobs, censor_threshold, censopt),
          "Value": c(-7,  0, 0,  -10, maxobs, censor_threshold, censopt),
          }
          ) )
p = concatenate_parameterisers(p, loglik)
p.as_dataframe()

Out[17]:

	Name	Value	Min	Max
0	subarea.Subarea.x1	542.198111	1.000000	3000.000000
1	subarea.Subarea.x2	-0.412754	-30.000000	30.000000
2	subarea.Subarea.x3	7.740339	1.000000	1000.000000
3	subarea.Subarea.x4	1.238855	1.000000	240.000000
4	b	-7.000000	-30.000000	5.000000
5	m	0.000000	0.000000	0.000000
6	s	0.000000	-10.000000	10.000000
7	a	-10.000000	-20.000000	0.000000
8	maxobs	17.221100	17.221100	17.221100
9	ct	0.172211	0.172211	0.172211
10	censopt	0.000000	0.000000	0.000000

Check that the objective calculator works, at least with the default values in the feasible parameter space:

In [18]:

Copied!

score = objective.get_score(p)
print(score)
score = objective.get_score(p)
print(score)

{'scores': {'Log-likelihood': -1e+20}, 'sysconfig':                   Name       Value        Min          Max
0   subarea.Subarea.x1  542.198111   1.000000  3000.000000
1   subarea.Subarea.x2   -0.412754 -30.000000    30.000000
2   subarea.Subarea.x3    7.740339   1.000000  1000.000000
3   subarea.Subarea.x4    1.238855   1.000000   240.000000
4                    b   -7.000000 -30.000000     5.000000
5                    m    0.000000   0.000000     0.000000
6                    s    0.000000 -10.000000    10.000000
7                    a  -10.000000 -20.000000     0.000000
8               maxobs   17.221100  17.221100    17.221100
9                   ct    0.172211   0.172211     0.172211
10             censopt    0.000000   0.000000     0.000000}

In [19]:

Copied!

mod_runoff = ms.get_recorded(runoff_depth_varname)
mod_runoff = ms.get_recorded(runoff_depth_varname)

In [20]:

Copied!

from swift2.vis import plot_two_series
from swift2.vis import plot_two_series

In [21]:

Copied!

plot_two_series(flow, mod_runoff, ylab="obs/mod runoff", start_time = "2000-01-01", end_time = "2002-12-31", names=['observed','modelled'])
plot_two_series(flow, mod_runoff, ylab="obs/mod runoff", start_time = "2000-01-01", end_time = "2002-12-31", names=['observed','modelled'])

Calibration¶

Build the optimiser definition, instrument with a logger.

In [22]:

Copied!





# term = getMaxRuntimeTermination(max_hours = 0.3/60)  # ~20 second appears enough with SWIFT binaries in Release mode
# term = getMarginalTermination(tolerance = 1e-06, cutoff_no_improvement = 10, max_hours = 0.3/60) 
term = create_sce_termination_wila('relative standard deviation', c('0.005',str(1/60)))

sce_params = get_default_sce_parameters()
urs = create_parameter_sampler(0, p, 'urs')
optimiser = objective.create_sce_optim_swift(term, sce_params, urs)
calib_logger = optimiser.set_calibration_logger('')
# term = getMaxRuntimeTermination(max_hours = 0.3/60)  # ~20 second appears enough with SWIFT binaries in Release mode
# term = getMarginalTermination(tolerance = 1e-06, cutoff_no_improvement = 10, max_hours = 0.3/60) 
term = create_sce_termination_wila('relative standard deviation', c('0.005',str(1/60)))

sce_params = get_default_sce_parameters()
urs = create_parameter_sampler(0, p, 'urs')
optimiser = objective.create_sce_optim_swift(term, sce_params, urs)
calib_logger = optimiser.set_calibration_logger('')

In [23]:

Copied!

%%time 
calib_results = optimiser.execute_optimisation()
%%time 
calib_results = optimiser.execute_optimisation()

CPU times: user 5min 7s, sys: 55 ms, total: 5min 7s
Wall time: 1min

In [24]:

Copied!

opt_log = extract_optimisation_log(optimiser, fitness_name = 'Log-likelihood')
geom_ops = opt_log.subset_by_message(pattern= 'Initial.*|Reflec.*|Contrac.*|Add.*') 
opt_log = extract_optimisation_log(optimiser, fitness_name = 'Log-likelihood')
geom_ops = opt_log.subset_by_message(pattern= 'Initial.*|Reflec.*|Contrac.*|Add.*')

In [25]:

Copied!

import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

In [26]:

Copied!

ll_max = max(geom_ops._data['Log-likelihood'].values)
ll_min = np.median(geom_ops._data['Log-likelihood'].values)
ll_max = max(geom_ops._data['Log-likelihood'].values)
ll_min = np.median(geom_ops._data['Log-likelihood'].values)

Parameter plots¶

In [27]:

Copied!





p_var_ids = p.as_dataframe().Name.values
v = OptimisationPlots(geom_ops)
for pVar in p_var_ids:
    g = v.parameter_evolution(pVar, obj_lims=[ll_min, ll_max])
    plt.gcf().set_size_inches(10,8)
p_var_ids = p.as_dataframe().Name.values
v = OptimisationPlots(geom_ops)
for pVar in p_var_ids:
    g = v.parameter_evolution(pVar, obj_lims=[ll_min, ll_max])
    plt.gcf().set_size_inches(10,8)

Finally, get a visual of the runoff time series with the best known parameter set (the penultimate entry in the data frame with the log of the calibration process).

In [28]:

Copied!

sortedResults = sort_by_score(calib_results, 'Log-likelihood')
sortedResults.as_dataframe().head().T
sortedResults = sort_by_score(calib_results, 'Log-likelihood')
sortedResults.as_dataframe().head().T

Out[28]:

	0	1	2	3	4
Log-likelihood	3343.112690	3343.084541	3342.887651	3342.720004	3342.680787
subarea.Subarea.x1	142.430051	142.100098	142.279668	142.892263	139.559299
subarea.Subarea.x2	-29.938630	-29.965924	-29.996436	-29.952763	-29.978160
subarea.Subarea.x3	769.340177	769.099672	770.384501	768.784223	771.748353
subarea.Subarea.x4	1.001498	1.000013	1.001017	1.001003	1.000485
b	-1.217889	-1.082167	-1.142510	-1.125674	-1.143431
m	0.000000	0.000000	0.000000	0.000000	0.000000
s	0.557830	0.426453	0.472851	0.469844	0.490444
a	-8.716393	-8.501980	-8.275356	-8.452939	-8.452515
maxobs	17.221100	17.221100	17.221100	17.221100	17.221100
ct	0.172211	0.172211	0.172211	0.172211	0.172211
censopt	0.000000	0.000000	0.000000	0.000000	0.000000

In [29]:

Copied!





best_pset = calib_results.get_best_score('Log-likelihood').parameteriser
best_pset.apply_sys_config(ms)
ms.exec_simulation()
mod_runoff = ms.get_recorded(runoff_depth_varname)
# joki::plot_two_series(flow, mod_runoff, ylab="obs/mod runoff", startTime = start(flow), endTime = end(flow))
best_pset = calib_results.get_best_score('Log-likelihood').parameteriser
best_pset.apply_sys_config(ms)
ms.exec_simulation()
mod_runoff = ms.get_recorded(runoff_depth_varname)
# joki::plot_two_series(flow, mod_runoff, ylab="obs/mod runoff", startTime = start(flow), endTime = end(flow))

In [31]:

Copied!

mod_runoff.squeeze(drop=True).sel(time=slice(e - pd.offsets.DateOffset(years=1), e)).plot(figsize=(16,9))
mod_runoff.squeeze(drop=True).sel(time=slice(e - pd.offsets.DateOffset(years=1), e)).plot(figsize=(16,9))

Out[31]:

[<matplotlib.lines.Line2D at 0x7fa6167f78d0>]

In [32]:

Copied!

plot_two_series(flow, mod_runoff, ylab="obs/mod runoff", start_time = "2000-01-01", end_time = "2002-12-31", names=['observed','modelled'])
plot_two_series(flow, mod_runoff, ylab="obs/mod runoff", start_time = "2000-01-01", end_time = "2002-12-31", names=['observed','modelled'])