Functions to augment the user’s dataset with information from official sources.
gingado provides data augmentation functionalities that can help users to augment their datasets with a time series dimension. This can be done both on a stand-alone basis as the user incorporates new data on top of the original dataset, or as part of a scikit-learnPipeline that also includes other steps like data transformation and model estimation.
Data augmentation with SDMX
The Statistical Data and Metadata eXchange (SDMX) is an ISO standard comprising:
technical standards
statistical guidelines, including cross-domain concepts and codelists
an IT architecture and tools
SDMX is sponsored by the Bank for International Settlements, European Central Bank, Eurostat, International Monetary Fund, Organisation for Economic Co-operation and Development, United Nations, and World Bank Group.
More information about the SDMX is available on its webpage.
gingado uses SDMX to augment user datasets through the transformer AugmentSDMX.
For example, the code below is a simple illustration of AugmentSDMX augmentation under two scenarios: without a variance threshold (ie, including all data regardless if they are constants) or with a relatively high variance threshold (such that no data is actually added).
In both cases, the object is using the default data flow, which is the daily series of monetary policy rates set by central banks.
These AugmentSDMX objects are used to augment a data frame with simulated data for illustrative purposes. In real life, this data would be the user’s original data.
Querying data from BIS's dataflow 'WS_CBPOL' - Policy rate...
No columns added to original data because no feature in x meets the variance threshold 10.00000
A transformer that augments a dataset using SDMX data.
Attributes:
sources (dict): A dictionary with sources as keys and dataflows as values.
variance_threshold (float | None): Variables with lower variance through time are removed if specified. Otherwise, all variables are kept.
propagate_last_known_value (bool): Whether to propagate the last known non-NA value to following dates.
fillna (float | int): Value to use to fill missing data.
verbose (bool): Whether to inform the user about the process progress.
fit (self, X: 'pd.Series | pd.DataFrame', y: 'None' = None)
Fits the instance of AugmentSDMX to `X`, learning its time series frequency.
Args:
X (pd.Series | pd.DataFrame): Data having an index of `datetime` type.
y (None): `y` is kept as an argument for API consistency only.
Returns:
AugmentSDMX: A fitted version of the same AugmentSDMX instance.
Transforms input dataset `X` by adding the requested data using SDMX.
Args:
X (pd.Series | pd.DataFrame): Data having an index of `datetime` type.
y (None): `y` is kept as an argument for API consistency only.
training (bool): `True` if `transform` is called during training, `False` (default) if called during testing.
Returns:
np.ndarray: `X` augmented with data from SDMX with the same number of samples but more columns.
Fit to data, then transform it.
Args:
X (pd.Series | pd.DataFrame): Data having an index of `datetime` type.
y (None): `y` is kept as an argument for API consistency only.
Returns:
np.ndarray: `X` augmented with data from SDMX with the same number of samples but more columns.
Compatibility with scikit-learn
As mentioned above, gingado’s transformers are built to be compatible with scikit-learn. The code below demonstrates this compatibility.
First, we create the example dataset. In this case, it comprises the daily foreign exchange rate of selected currencies to the Euro. The Brazilian Real (BRL) is chosen for this example as the dependent variable.
from gingado.utils import load_SDMX_data, Lagfrom sklearn.model_selection import TimeSeriesSplit
X = load_SDMX_data( sources={'ECB': 'EXR'}, keys={'FREQ': 'D', 'CURRENCY': ['EUR', 'AUD', 'BRL', 'CAD', 'CHF', 'GBP', 'JPY', 'SGD', 'USD']}, params={"startPeriod": 2003} )# drop rows with empty valuesX.dropna(inplace=True)# adjust column names in this simple example for ease of understanding:# remove parts related to source and dataflow namesX.columns = X.columns.str.replace("ECB__EXR_D__", "").str.replace("__EUR__SP00__A", "")X = Lag(lags=1, jump=0, keep_contemporaneous_X=True).fit_transform(X)y = X.pop('BRL')# retain only the lagged variables in the X variableX = X[X.columns[X.columns.str.contains('_lag_')]]
Querying data from ECB's dataflow 'EXR' - Exchange Rates...
Next, the data augmentation object provided by gingado adds more data. In this case, for brevity only one dataflow from one source is listed. If users want to add more SDMX sources, simply add more keys to the dictionary. And if users want data from all dataflows from a given source provided the keys and parameters such as frequency and dates match, the value should be set to 'all', as in {'ECB': ['CISS'], 'BIS': 'all'}.
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
This is the dataset now after this particular augmentation:
print(f"No of columns: {len(X_train__fit_transform.columns)}{X_train__fit_transform.columns}")X_train__fit_transform
Tuning the data augmentation to enhance model performance
And since AugmentSDMX can be included in a Pipeline, it can also be fine-tuned by parameter search techniques (such as grid search), further helping users make the best of available data to enhance performance of their models.
Tip
Users can cache the data augmentation step to avoid repeating potentially lengthy data downloads. See the memory argument in the sklearn.pipeline.Pipeline documentation.
Fitting 2 folds for each of 2 candidates, totalling 4 fits
[Pipeline] ...... (step 1 of 3) Processing augmentation, total= 0.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total= 0.0s
[Pipeline] ............ (step 3 of 3) Processing forest, total= 1.6s
[CV] END ...........................augmentation=passthrough; total time= 1.7s
[Pipeline] ...... (step 1 of 3) Processing augmentation, total= 0.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total= 0.0s
[Pipeline] ............ (step 3 of 3) Processing forest, total= 3.4s
[CV] END ...........................augmentation=passthrough; total time= 3.5s
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[Pipeline] ...... (step 1 of 3) Processing augmentation, total= 8.4s
[Pipeline] ............... (step 2 of 3) Processing imp, total= 0.4s
[Pipeline] ............ (step 3 of 3) Processing forest, total= 5.1s
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[CV] END ..augmentation=AugmentSDMX(sources={'ECB': 'CISS'}); total time= 21.3s
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[Pipeline] ...... (step 1 of 3) Processing augmentation, total= 15.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total= 0.6s
[Pipeline] ............ (step 3 of 3) Processing forest, total= 11.1s
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[CV] END ..augmentation=AugmentSDMX(sources={'ECB': 'CISS'}); total time= 40.7s
[Pipeline] ...... (step 1 of 3) Processing augmentation, total= 0.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total= 0.0s
[Pipeline] ............ (step 3 of 3) Processing forest, total= 5.3s
grid.best_params_
{'augmentation': 'passthrough'}
print(f"In this particular case, the best model was achieved by {'not 'if grid.best_params_['augmentation'] =='passthrough'else''}using the data augmentation.")
In this particular case, the best model was achieved by not using the data augmentation.
print(f"The last value in the training dataset was {y_train.tail(1).to_numpy()}. The predicted value was {y_pred_grid}, and the actual value was {y_test.to_numpy()}.")
The last value in the training dataset was [6.1749]. The predicted value was [6.178231], and the actual value was [6.1328].
Sources of data
gingado seeks to only lists realiable data sources by choice, with a focus on official sources. This is meant to provide users with the trust that their dataset will be complemented by reliable sources. Unfortunately, it is not possible at this stage to include all official sources given the substantial manual and maintenance work. gingado leverages the existence of the Statistical Data and Metadata eXchange (SDMX), an organisation of official data sources that establishes common data and metadata formats, to download data that is relevant (and hopefully also useful) to users.
The function list_SDMX_sources returns a list of codes corresponding to the data sources available to provide gingado users with data through SDMX.
You can also see what the available dataflows are. The code below returns a dictionary where each key is the code for an SDMX source, and the values associated with each key are the code and name for the respective dataflows.
from gingado.utils import list_all_dataflows
dflows = list_all_dataflows()dflows
ABS ABORIGINAL_POP_PROJ Projected population, Aboriginal and Torres St...
ABORIGINAL_POP_PROJ_REMOTE Projected population, Aboriginal and Torres St...
ABS_ABORIGINAL_POPPROJ_INDREGION Projected population, Aboriginal and Torres St...
ABS_ACLD_LFSTATUS Australian Census Longitudinal Dataset (ACLD):...
ABS_ACLD_TENURE Australian Census Longitudinal Dataset (ACLD):...
...
UNSD DF_UNData_UNFCC SDMX_GHG_UNDATA
WB DF_WITS_Tariff_TRAINS WITS - UNCTAD TRAINS Tariff Data
DF_WITS_TradeStats_Development WITS TradeStats Devlopment
DF_WITS_TradeStats_Tariff WITS TradeStats Tariff
DF_WITS_TradeStats_Trade WITS TradeStats Trade
Name: dataflow, Length: 24650, dtype: object
For example, the dataflows from the World Bank are: