Utils

Functions to support the use of gingado

Support for model documentation

get_datetime

get_datetime ()

Returns the time now

d = get_datetime()
assert isinstance(d, str)
assert len(d) > 0

read_attr

read_attr (obj)

Reads and yields the type and values of fitted attributes from an object.

Args:
    obj: Object from which attributes will be read.

Function read_attr helps gingado Documenters to read the object behind the scenes.

It collects the type of estimator, and any attributes resulting from fitting an object (in ie, those that end in “_” without being double underscores).

For example, the attributes of an untrained and a trained random forest are, in sequence:

from sklearn.ensemble import RandomForestRegressor

rf_unfit = RandomForestRegressor(n_estimators=3)
rf_fit = RandomForestRegressor(n_estimators=3)\
    .fit([[1, 0], [0, 1]], [[0.5], [0.5]]) # random numbers
list(read_attr(rf_unfit)), list(read_attr(rf_fit))

([{'_estimator_type': 'regressor'}],
 [{'_estimator_type': 'regressor'},
  {'estimator_': DecisionTreeRegressor()},
  {'estimators_': [DecisionTreeRegressor(max_features=1.0, random_state=313810551),
    DecisionTreeRegressor(max_features=1.0, random_state=1429526222),
    DecisionTreeRegressor(max_features=1.0, random_state=1916612900)]},
  {'estimators_samples_': [array([0, 1], dtype=int32),
    array([0, 0], dtype=int32),
    array([1, 1], dtype=int32)]},
  {'feature_importances_': array([0., 0.])},
  {'n_features_in_': 2},
  {'n_outputs_': 1}])

Support for time series

Objects of the class Lag are similar to scikit-learn’s transformers.

Lag

Lag (lags=1, jump=0, keep_contemporaneous_X=False)

A transformer for lagging variables.

Args:
    lags (int): The number of lags to apply.
    jump (int): The number of initial observations to skip before applying the lag.
    keep_contemporaneous_X (bool): Whether to keep the contemporaneous values of X in the output.

fit

fit (self, X: numpy.ndarray, y=None)

Fits the Lag transformer.

Args:
    X (np.ndarray): Array-like data of shape (n_samples, n_features).
    y: Array-like data of shape (n_samples,) or (n_samples, n_targets) or None.
    
Returns:
    self: A fitted version of the `Lag` instance.

transform

transform (self, X: numpy.ndarray)

Applies the lag transformation to the dataset `X`.

Args:
    X (np.ndarray): Array-like data of shape (n_samples, n_features).
    
Returns:
    A lagged version of `X`.

fit_transform

fit_transform (self, X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to `X` and `y` with optional parameters `fit_params`
and returns a transformed version of `X`.

Parameters
----------
X : array-like of shape (n_samples, n_features)
    Input samples.

y :  array-like of shape (n_samples,) or (n_samples, n_outputs),                 default=None
    Target values (None for unsupervised transformations).

**fit_params : dict
    Additional fit parameters.

Returns
-------
X_new : ndarray array of shape (n_samples, n_features_new)
    Transformed array.

The code below demonstrates how Lag works in practice. Note in particular that, because Lag is a transformer, it can be used as part of a scikit-learn’s Pipeline.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

randomX = np.random.rand(15, 2)
randomY = np.random.rand(15)

lags = 3
jump = 2

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('lagger', Lag(lags=lags, jump=jump, keep_contemporaneous_X=False))
]).fit_transform(randomX, randomY)

Below we confirm that the lagger removes the correct number of rows corresponding to the lagged observations:

assert randomX.shape[0] - lags - jump == pipe.shape[0]

And because Lag is a transformer, its parameters (lags and jump) can be calibrated using hyperparameter tuning to achieve the best performance for a model.

Support for data augmentation with SDMX

Note

please note that working with SDMX may take some minutes depending on the amount of information you are downloading.

list_SDMX_sources

list_SDMX_sources ()

Fetches the list of SDMX sources.

Returns:
    The list of codes representing the SDMX sources available for data download.

sources = list_SDMX_sources()
print(sources)

assert len(sources) > 0
# all elements are of type 'str'
assert sum([isinstance(src, str) for src in sources]) == len(sources)

['ABS', 'ABS_JSON', 'AR1', 'BBK', 'BIS', 'COMP', 'ECB', 'EMPL', 'ESTAT', 'ESTAT3', 'ESTAT_COMEXT', 'GROW', 'ILO', 'IMF', 'IMF_beta', 'IMF_beta3', 'INEGI', 'INSEE', 'ISTAT', 'LSD', 'NB', 'NBB', 'OECD', 'OECD_JSON', 'SGR', 'SPC', 'STAT_EE', 'StatCan', 'UNESCO', 'UNICEF', 'UNSD', 'UY110', 'WB', 'WB_WDI']

list_all_dataflows

list_all_dataflows (codes_only: bool = False, return_pandas: bool = True)

Lists all SDMX dataflows. Note: When using as a parameter to an `AugmentSDMX` object
or to the `load_SDMX_data` function, set `codes_only=True`"

Args:
    codes_only (bool): Whether to return only the dataflow codes.
    return_pandas (bool): Whether to return the result in a pandas DataFrame format.
    
Returns:
    All available dataflows for all SDMX sources.

dflows = list_all_dataflows(return_pandas=False)

assert isinstance(dflows, dict)
all_sources = list_SDMX_sources()
assert len([s for s in dflows.keys() if s in all_sources]) == len(dflows.keys())

list_all_dataflows returns by default a pandas Series, facilitating data discovery by users like so:

dflows = list_all_dataflows(return_pandas=True)
assert type(dflows) == pd.core.series.Series

dflows

ABS    ABORIGINAL_POP_PROJ                 Projected population, Aboriginal and Torres St...
       ABORIGINAL_POP_PROJ_REMOTE          Projected population, Aboriginal and Torres St...
       ABS_ABORIGINAL_POPPROJ_INDREGION    Projected population, Aboriginal and Torres St...
       ABS_ACLD_LFSTATUS                   Australian Census Longitudinal Dataset (ACLD):...
       ABS_ACLD_TENURE                     Australian Census Longitudinal Dataset (ACLD):...
                                                                 ...                        
UY110  DF_TTT_SEXO_AREA_REF                Tasa de Teletrabajo por sexo y area de referencia
WB     DF_WITS_Tariff_TRAINS                                WITS - UNCTAD TRAINS Tariff Data
       DF_WITS_TradeStats_Development                             WITS TradeStats Devlopment
       DF_WITS_TradeStats_Tariff                                      WITS TradeStats Tariff
       DF_WITS_TradeStats_Trade                                        WITS TradeStats Trade
Name: dataflow, Length: 33154, dtype: object

This format allows for more easily searching dflows by source:

list_all_dataflows(codes_only=True, return_pandas=True)

ABS    0                  ABORIGINAL_POP_PROJ
       1           ABORIGINAL_POP_PROJ_REMOTE
       2     ABS_ABORIGINAL_POPPROJ_INDREGION
       3                    ABS_ACLD_LFSTATUS
       4                      ABS_ACLD_TENURE
                           ...               
UY110  86                DF_TTT_SEXO_AREA_REF
WB     0                DF_WITS_Tariff_TRAINS
       1       DF_WITS_TradeStats_Development
       2            DF_WITS_TradeStats_Tariff
       3             DF_WITS_TradeStats_Trade
Name: dataflow, Length: 33154, dtype: object

dflows['BIS']

BIS_REL_CAL                                     BIS_RELEASE_CALENDAR
WS_CBPOL                                   Central bank policy rates
WS_CBS_PUB                                  BIS consolidated banking
WS_CBTA                                    Central bank total assets
WS_CPMI_CASHLESS                      CPMI cashless payments (T5,T6)
WS_CPMI_CT1                           CPMI comparative tables type 1
WS_CPMI_CT2                           CPMI comparative tables type 2
WS_CPMI_DEVICES                            CPMI payment devices (T4)
WS_CPMI_INSTITUT                              CPMI institutions (T3)
WS_CPMI_MACRO                                     CPMI macro (T1,T2)
WS_CPMI_PARTICIP                  CPMI participants (T7,T10,T12,T15)
WS_CPMI_SYSTEMS     CPMI systems (T8,T9,T11,T13,T14,T16,T17,T18,T19)
WS_CPP                                    Commercial property prices
WS_CREDIT_GAP                                 BIS credit-to-GDP gaps
WS_DEBT_SEC2_PUB    BIS international debt securities (BIS-compiled)
WS_DER_OTC_TOV                              OTC derivatives turnover
WS_DPP                          Detailed residential property prices
WS_DSR                                        BIS debt service ratio
WS_EER                                  BIS effective exchange rates
WS_GLI                                   Global liquidity indicators
WS_LBS_D_PUB                                  BIS locational banking
WS_LONG_CPI                                 BIS long consumer prices
WS_NA_SEC_C3                          BIS debt securities statistics
WS_NA_SEC_DSS                         BIS Debt securities statistics
WS_OTC_DERIV2                            OTC derivatives outstanding
WS_SPP                          Selected residential property prices
WS_TC                                BIS long series on total credit
WS_XRU                                      US dollar exchange rates
WS_XTD_DERIV                             Exchange traded derivatives
Name: dataflow, dtype: object

Or the user can search dataflows by their human-readable name instead of their code. For example, this is one way to see if any dataflow has information on interest rates:

dflows[dflows.str.contains('Interest rate', case=False)]

ECB     IRS                                            Interest rate statistics
        MIR                                        MFI Interest Rate Statistics
        RIR                                               Retail Interest Rates
ESTAT   TEIMF040                                          3-month-interest rate
        TEIMF100                         Day-to-day money market interest rates
        IRT_ST_A                      Money market interest rates - annual data
        IRT_ST_M                     Money market interest rates - monthly data
        IRT_ST_Q                   Money market interest rates - quarterly data
        EI_MFIR_M                                 Interest rates - monthly data
        ENPE_IRT_LD                     Loan and deposit one year interest rate
        ENPE_IRT_ST                                 Money market interest rates
ESTAT3  IRT_ST_A                      Money market interest rates - annual data
        IRT_ST_M                     Money market interest rates - monthly data
        IRT_ST_Q                   Money market interest rates - quarterly data
        EI_MFIR_M                                 Interest rates - monthly data
        ENPE_IRT_LD                     Loan and deposit one year interest rate
        TEIMF040                                          3-month-interest rate
        ENPE_IRT_ST                                 Money market interest rates
        TEIMF100                         Day-to-day money market interest rates
IMF     6SR                   M&B: Interest Rates and Share Prices (6SR) for...
        INR                                                      Interest rates
        INR_NSTD                                    Interest rates_Non-Standard
NB      GOVT_GENERIC_RATES                               Generic interest rates
        GOVT_IRS                                            Interest rate swaps
Name: dataflow, dtype: object

codelists

codelists (dflow)

Retrieves the codelist for specific SDMX dataflows.

Args:
    dflow (dict): A dictionary specifying the source as the key and the dataflow (or list of dataflows) as the value.
    
Returns:
    dict: A dictionary with each source as a key and the values containing the codelist for the specified dataflow(s),detailing code information relevant to the SDMX data structure.

Once the user finds a dataflow of interest, the function codelists returns a dictionary where each key is a dimension of that dataflow, and each value is that dimension’s codelist.

For example, the dimensions and codelists of the BIS’ dataflow on OTC derivatives outstanding are the following:

print(dflows[dflows.str.contains("OTC", case=False)])

cl_OTC = codelists(dflow={"BIS": "WS_OTC_DERIV2"})

BIS  WS_DER_OTC_TOV       OTC derivatives turnover
     WS_OTC_DERIV2     OTC derivatives outstanding
Name: dataflow, dtype: object

Here is a list of all dimensions for the OTC derivatives outstanding dataflow:

cl_OTC_BIS = cl_OTC["BIS"]
cl_OTC_BIS.keys()

dict_keys(['CL_AVAILABILITY', 'CL_BIS_IF_REF_AREA', 'CL_BIS_UNIT', 'CL_COLLECTION', 'CL_CONF_STATUS', 'CL_DECIMALS', 'CL_DER_BASIS', 'CL_DER_INSTR', '3', 'W', 'CL_EX_METHOD', 'CL_FREQ', 'CL_ISSUE_MAT', 'C', 'CL_MARKET_RISK', 'CL_OBS_STATUS', 'D', 'H', 'CL_OD_TYPE', 'CL_RATING', 'CL_SECTOR_CPY', 'CL_SECTOR_UDL', 'CL_SUB_CHANNEL', 'CL_TIME_FORMAT', 'CL_UNIT_MULT'])

Below are the codelists of the frequency dimension (“CL_FREQ”) and the counterparty sector (“CL_SECTOR_CPY”):

cl_OTC_BIS["CL_FREQ"]

CL_FREQ
A                                   Annual
B    Daily - business week (not supported)
D                                    Daily
E                    Event (not supported)
H                              Half-yearly
M                                  Monthly
Q                                Quarterly
W                                   Weekly
Name: Code list for Frequency (FREQ), dtype: object

cl_OTC_BIS["CL_SECTOR_CPY"]

	name	parent
CL_SECTOR_CPY
A	Total (all counterparties)
B	Reporting dealers	A
C	Other financial institutions	A
D	Non-reporting banks	C
E	Institutional investors	C
F	Hedge funds and proprietary trading firms	C
G	Official sector financial institutions	C
H	Undistributed	C
K	Central Counterparties	C
L	Banks and securities firms	C
M	Insurance and financial guaranty firms	C
N	SPVs, SPCs or SPEs	C
O	Hedge funds	C
P	Other residual financial institutions	C
U	Non-financial customers	A
V	Prime brokered	A
W	Retail-driven	A
X	Related Party Trades	A
Y	Own branches and subsidiaries	A
Z	Non-reporters	A
Q	Non-bank electronic market-makers	A
R	Other customers	A
I	Back-to-back trades	A
J	Compression trades	A
0	Technical residual (total)
1	Technical residual (other financial institutions)
2	Technical residual (Prime brokered)

You can also get codelists for multiple dataflows from the same source:

# Get codelists for both Exchange Rates and Consumer Prices
cl_multiple = codelists({"ECB": ["EXR", "ICP"]})

# Show dimensions for each dataflow
for dataflow, codelist in cl_multiple["ECB"].items():
    print(f"\nDimensions in {dataflow} dataflow:")
    print(codelist.keys())


Dimensions in EXR dataflow:
dict_keys(['CL_COLLECTION', 'CL_CURRENCY', 'CL_DECIMALS', 'CL_EXR_SUFFIX', 'CL_EXR_TYPE', 'CL_FREQ', 'CL_OBS_CONF', 'CL_OBS_STATUS', 'CL_ORGANISATION', 'CL_UNIT', 'CL_UNIT_MULT'])

Dimensions in ICP dataflow:
dict_keys(['CL_ADJUSTMENT', 'CL_AREA_EE', 'CL_COLLECTION', 'CL_DECIMALS', 'CL_FREQ', 'CL_ICP_ITEM', 'CL_ICP_SUFFIX', 'CL_OBS_CONF', 'CL_OBS_STATUS', 'CL_ORGANISATION', 'CL_STS_INSTITUTION', 'CL_UNIT', 'CL_UNIT_MULT'])

In addition, you can also get codelists from multiple sources:

cl_sources = codelists({"ECB": "EXR", "BIS": "WS_OTC_DERIV2"})

print("Available sources:", cl_sources.keys())

Available sources: dict_keys(['ECB', 'BIS'])

The function load_SDMX_data is a convenience function that downloads data from SDMX sources (and any specific dataflows passed as arguments) if they match the key and parameters set by the user.

load_SDMX_data

load_SDMX_data (sources: dict, keys: dict, params: dict, verbose: bool = True)

Loads datasets from SDMX.

Args:
    sources (dict): A dictionary with the sources and dataflows per source.
    keys (dict): The keys to be used in the SDMX query.
    params (dict): The parameters to be used in the SDMX query.
    verbose (bool): Whether to communicate download steps to the user.
    
Returns:
    A pandas DataFrame with data from SDMX or None if no data matches the sources, keys, and parameters.

df = load_SDMX_data(sources={'ECB': 'CISS', 'BIS': 'WS_CBPOL_D'}, keys={'FREQ': 'D'}, params={'startPeriod': 2003})

assert type(df) == pd.DataFrame
assert df.shape[0] > 0
assert df.shape[1] > 0

Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
Querying data from BIS's dataflow 'WS_CBPOL' - Central bank policy rates...

Temporal features

Temporal features, such as the day of the week, month, or hour, provide valuable information for time series data, helping to capture seasonality, trends, and cyclic patterns. These features are especially useful because they represent known future information that can enhance model predictions. The gingado library offers the get_timefeat method to extract these features from a time series:

get_timefeat

get_timefeat (df: pandas.core.frame.DataFrame | pandas.core.series.Series, freq: str | gingado.internals.Frequency, columns: list[str] | None = None, add_to_df: bool = True) -> pandas.core.frame.DataFrame

Generate temporal features from a DataFrame with a DatetimeIndex.

This function creates various time-based features such as day of week,
day of month, week of year, etc., based on the DatetimeIndex of the input DataFrame.

Args:
    df (pd.DataFrame | pd.Series): Input DataFrame or Series with a DatetimeIndex.
    freq (FrequencyLike): Frequency of the input DataFrame. Can either be a string which is
        a supported pandas frequency alias or an gingado-interal Frequency object.
    columns (list[str], optional): List of colums with temporal feature names that should be
        kept. If None, all default temporal features are returned. Defaults to None.
    add_to_df (bool, optional): If True, append the generated features to the input DataFrame.
        If False, return only the generated features. Defaults to True.

Returns:
    pd.DataFrame: A DataFrame containing the generated temporal features,
        either appended to the input DataFrame or as a separate DataFrame.

Raises:
    ValueError: If the input DataFrame's index is not a DatetimeIndex.

For instance, using daily data from a DataFrame:

# Display the first few rows of the DataFrame
display(df.head())

# Extract temporal features for daily data
temporal = get_timefeat(df, freq="D", add_to_df=False)
display(temporal.head())

	ECB__CISS_D__AT__Z0Z__4F__EC__SS_CIN__IDX	ECB__CISS_D__BE__Z0Z__4F__EC__SS_CIN__IDX	ECB__CISS_D__CN__Z0Z__4F__EC__SS_CIN__IDX	ECB__CISS_D__DE__Z0Z__4F__EC__SS_CIN__IDX	ECB__CISS_D__ES__Z0Z__4F__EC__SS_CIN__IDX	ECB__CISS_D__FI__Z0Z__4F__EC__SS_CIN__IDX	ECB__CISS_D__FR__Z0Z__4F__EC__SS_CIN__IDX	ECB__CISS_D__GB__Z0Z__4F__EC__SS_CIN__IDX	ECB__CISS_D__IE__Z0Z__4F__EC__SS_CIN__IDX	ECB__CISS_D__IT__Z0Z__4F__EC__SS_CIN__IDX	...	BIS__WS_CBPOL_D__RS	BIS__WS_CBPOL_D__RU	BIS__WS_CBPOL_D__SA	BIS__WS_CBPOL_D__SE	BIS__WS_CBPOL_D__TH	BIS__WS_CBPOL_D__TR	BIS__WS_CBPOL_D__US	BIS__WS_CBPOL_D__XM	BIS__WS_CBPOL_D__ZA	BIS__WS_CBPOL_D__AR
TIME_PERIOD
2003-01-01	0.017774	0.042273	NaN	0.107753	0.028552	0.053814	0.005528	0.060809	0.004191	0.057108	...	9.5	NaN	2.0	NaN	1.75	NaN	1.25	2.75	NaN	NaN
2003-01-02	0.023427	0.047823	NaN	0.148028	0.039988	0.075186	0.013415	0.049041	0.014820	0.064289	...	9.5	NaN	NaN	3.75	1.75	44.0	1.25	2.75	13.5	5.99
2003-01-03	0.021899	0.043292	NaN	0.141700	0.040378	0.077400	0.014249	0.047883	0.016874	0.064880	...	9.5	NaN	NaN	3.75	1.75	44.0	1.25	2.75	13.5	6.05
2003-01-04	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	9.5	21.0	2.0	NaN	1.75	NaN	1.25	2.75	13.5	NaN
2003-01-05	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	9.5	21.0	2.0	NaN	1.75	NaN	1.25	2.75	NaN	NaN

5 rows × 62 columns

	day_of_week	day_of_month	day_of_quarter	day_of_year	week_of_month	week_of_quarter	week_of_year	month_of_quarter	month_of_year	quarter_of_year	quarter_end	year_end
TIME_PERIOD
2003-01-01	2	1	1	1	1	1	1	1	1	1	0	0
2003-01-02	3	2	2	2	1	1	1	1	1	1	0	0
2003-01-03	4	3	3	3	1	1	1	1	1	1	0	0
2003-01-04	5	4	4	4	1	1	1	1	1	1	0	0
2003-01-05	6	5	5	5	1	1	1	1	1	1	0	0

You can also integrate the temporal features directly into the original DataFrame by setting the add_to_df parameter to True:

# Generate a sample DataFrame with a weekly index
df_weekly = pd.DataFrame(
    data={"value": rng.normal(size=100)},
    index=pd.date_range('2000-01-01', periods=100, freq='W-MON')
)

# Add temporal features to the weekly data
df_with_timefeat = get_timefeat(df_weekly, freq="W", add_to_df=True)
display(df_with_timefeat.head())

	value	week_of_month	week_of_quarter	week_of_year	month_of_quarter	month_of_year	quarter_of_year
2000-01-03	0.304717	1	1	1	1	1	1
2000-01-10	-1.039984	2	2	2	1	1	1
2000-01-17	0.750451	3	3	3	1	1	1
2000-01-24	0.940565	4	4	4	1	1	1
2000-01-31	-1.951035	5	5	5	1	1	1

If you only need a subset of the temporal features, you can specify the desired feature names:

# Generate a new DataFrame with a monthly index
df_monthly = pd.DataFrame(
    data={"value": rng.normal(size=24)},
    index=pd.date_range("2023-01-01", periods=24, freq='MS')
)
# Only select a subset of temporal features:
df_with_timefeat = get_timefeat(df_monthly, freq="MS", columns=["month_of_year", "quarter_of_year"])
display(df_with_timefeat.head())

	value	month_of_year	quarter_of_year
2023-01-01	-0.378163	1	1
2023-02-01	1.299228	2	1
2023-03-01	-0.356264	3	1
2023-04-01	0.737516	4	2
2023-05-01	-0.933618	5	2

In addition to get_timefeat, the gingado library provides the TemporalFeatureTransformer class, which can be used to transform a DataFrame with a temporal index into a DataFrame with additional features:

temp_trf = TemporalFeatureTransformer(freq="W", features=["week_of_month", "week_of_year", "quarter_of_year"])
df_with_timefeat = temp_trf.fit_transform(df_weekly)
display(df_with_timefeat.head())

	value	week_of_month	week_of_year	quarter_of_year
2000-01-03	0.304717	1	1	1
2000-01-10	-1.039984	2	2	1
2000-01-17	0.750451	3	3	1
2000-01-24	0.940565	4	4	1
2000-01-31	-1.951035	5	5	1