= get_datetime()
d assert isinstance(d, str)
assert len(d) > 0
Utils
gingado
Support for model documentation
get_datetime
get_datetime
()
Returns the time now
read_attr
read_attr
(obj)
Reads and yields the type and values of fitted attributes from an object. Args: obj: Object from which attributes will be read.
Function read_attr
helps gingado Documenters to read the object behind the scenes.
It collects the type of estimator, and any attributes resulting from fitting an object (in ie, those that end in “_” without being double underscores).
For example, the attributes of an untrained and a trained random forest are, in sequence:
from sklearn.ensemble import RandomForestRegressor
= RandomForestRegressor(n_estimators=3)
rf_unfit = RandomForestRegressor(n_estimators=3)\
rf_fit 1, 0], [0, 1]], [[0.5], [0.5]]) # random numbers
.fit([[list(read_attr(rf_unfit)), list(read_attr(rf_fit))
([{'_estimator_type': 'regressor'}],
[{'_estimator_type': 'regressor'},
{'estimator_': DecisionTreeRegressor()},
{'estimators_': [DecisionTreeRegressor(max_features=1.0, random_state=352967703),
DecisionTreeRegressor(max_features=1.0, random_state=1346575655),
DecisionTreeRegressor(max_features=1.0, random_state=168455287)]},
{'estimators_samples_': [array([1, 0]), array([1, 1]), array([1, 0])]},
{'feature_importances_': array([0., 0.])},
{'n_features_in_': 2},
{'n_outputs_': 1}])
Support for time series
Objects of the class Lag
are similar to scikit-learn
’s transformers.
Lag
Lag
(lags=1, jump=0, keep_contemporaneous_X=False)
A transformer for lagging variables. Args: lags (int): The number of lags to apply. jump (int): The number of initial observations to skip before applying the lag. keep_contemporaneous_X (bool): Whether to keep the contemporaneous values of X in the output.
fit
fit
(self, X: numpy.ndarray, y=None)
Fits the Lag transformer. Args: X (np.ndarray): Array-like data of shape (n_samples, n_features). y: Array-like data of shape (n_samples,) or (n_samples, n_targets) or None. Returns: self: A fitted version of the `Lag` instance.
transform
transform
(self, X: numpy.ndarray)
Applies the lag transformation to the dataset `X`. Args: X (np.ndarray): Array-like data of shape (n_samples, n_features). Returns: A lagged version of `X`.
fit_transform
fit_transform
(self, X, y=None, **fit_params)
Fit to data, then transform it. Fits transformer to `X` and `y` with optional parameters `fit_params` and returns a transformed version of `X`. Parameters ---------- X : array-like of shape (n_samples, n_features) Input samples. y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None Target values (None for unsupervised transformations). **fit_params : dict Additional fit parameters. Returns ------- X_new : ndarray array of shape (n_samples, n_features_new) Transformed array.
The code below demonstrates how Lag
works in practice. Note in particular that, because Lag
is a transformer, it can be used as part of a scikit-learn
’s Pipeline
.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
= np.random.rand(15, 2)
randomX = np.random.rand(15)
randomY
= 3
lags = 2
jump
= Pipeline([
pipe 'scaler', StandardScaler()),
('lagger', Lag(lags=lags, jump=jump, keep_contemporaneous_X=False))
( ]).fit_transform(randomX, randomY)
Below we confirm that the lagger removes the correct number of rows corresponding to the lagged observations:
assert randomX.shape[0] - lags - jump == pipe.shape[0]
And because Lag
is a transformer, its parameters (lags
and jump
) can be calibrated using hyperparameter tuning to achieve the best performance for a model.
Support for data augmentation with SDMX
please note that working with SDMX may take some minutes depending on the amount of information you are downloading.
list_SDMX_sources
list_SDMX_sources
()
Fetches the list of SDMX sources. Returns: The list of codes representing the SDMX sources available for data download.
= list_SDMX_sources()
sources print(sources)
assert len(sources) > 0
# all elements are of type 'str'
assert sum([isinstance(src, str) for src in sources]) == len(sources)
['ABS', 'ABS_JSON', 'BBK', 'BIS', 'COMP', 'ECB', 'EMPL', 'ESTAT', 'ESTAT3', 'ESTAT_COMEXT', 'GROW', 'ILO', 'IMF', 'INEGI', 'INSEE', 'ISTAT', 'LSD', 'NB', 'NBB', 'OECD', 'OECD_JSON', 'SGR', 'SPC', 'STAT_EE', 'UNESCO', 'UNICEF', 'UNSD', 'WB', 'WB_WDI']
list_all_dataflows
list_all_dataflows
(codes_only: bool = False, return_pandas: bool = True)
Lists all SDMX dataflows. Note: When using as a parameter to an `AugmentSDMX` object or to the `load_SDMX_data` function, set `codes_only=True`" Args: codes_only (bool): Whether to return only the dataflow codes. return_pandas (bool): Whether to return the result in a pandas DataFrame format. Returns: All available dataflows for all SDMX sources.
= list_all_dataflows(return_pandas=False)
dflows
assert isinstance(dflows, dict)
= list_SDMX_sources()
all_sources assert len([s for s in dflows.keys() if s in all_sources]) == len(dflows.keys())
--- SS without structure ---
1 (140728747105224) False
--- <class 'sdmx.message.StructureMessage'> ---
2 (2860620953424) <sdmx.StructureMessage>
<Header>
id: 'C2C8DEB97684424693070ED3366216DD'
prepared: '2024-10-21T11:23:21.319000+00:00'
sender: <Agency ESTAT>
source:
test: False
--- <class 'sdmx.model.common.Annotation'> ---
10 (2860615702224) Annotation(id=None, title='DATASET', type='DISSEMINATION_OBJECT_TYPE', url=None, text=)
14 (2860615702672) Annotation(id=None, title='685', type='OBS_COUNT', url=None, text=)
18 (2860620797776) Annotation(id=None, title='1999', type='OBS_PERIOD_OVERALL_OLDEST', url=None, text=)
22 (2860589958992) Annotation(id=None, title='2022', type='OBS_PERIOD_OVERALL_LATEST', url=None, text=)
26 (2860620511824) Annotation(id=None, title='2020-09-17T13:28:00+0200', type='CREATED', url=None, text=)
30 (2860620376976) Annotation(id=None, title='2023-10-25T23:00:00+0200', type='UPDATE_STRUCTURE', url=None, text=)
34 (2860619886928) Annotation(id=None, title='2023-10-25T23:00:00+0200', type='UPDATE_DATA', url=None, text=)
39 (2860614709072) Annotation(id=None, title='Explanatory texts (metadata)', type='ESMS_HTML', url='https://ec.europa.eu/eurostat/cache/metadata/en/reg_lmk_esms.htm', text=)
44 (2860620804880) Annotation(id=None, title='Explanatory texts (metadata)', type='ESMS_SDMX', url='https://ec.europa.eu/eurostat/api/dissemination/files?file=metadata/reg_lmk_esms.sdmx.zip', text=)
50 (2860620718096) Annotation(id=None, title=None, type='SOURCE_INSTITUTIONS', url=None, text=de: Eurostat
en: Eurostat
fr: Eurostat)
54 (2860605908496) Annotation(id=None, title='<adms:identifier xmlns:adms="http://www.w3.org/ns/adms#" xmlns:skos="http://www.w3.org/2004/02/skos/core.html" xmlns:dct="http://purl.org/dc/terms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><adms:Identifier rdf:about="https://doi.org/10.2908/LFST_R_LMDGEG"><skos:notation rdf:datatype="http://purl.org/spar/datacite/doi">10.2908/LFST_R_LMDGEG</skos:notation><dct:creator rdf:resource="http://publications.europa.eu/resource/authority/corporate-body/ESTAT"/><dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2023-01-19</dct:issued></adms:Identifier></adms:identifier>', type='DISSEMINATION_DOI_XML', url=None, text=)
--- Name ---
55 (2860621135168) ('de', 'Regionale Disparitäten bei den geschlechtsspezifischen Unterschieden in der Beschäftigung (NUTS-Ebene 2)')
56 (2860621695040) ('fr', "Disparités régionales des écarts d'emploi entre les hommes et les femmes (niveau NUTS 2)")
57 (2860621434304) ('en', 'Regional disparities in gender employment gap (NUTS level 2)')
--- Structure ---
LFST_R_LMDGEG (2860621808256) DataStructureDefinition=ESTAT:LFST_R_LMDGEG(26.0) → DataStructureDefinition=LFST_R_LMDGEG
Ignore:
{140728747158792}
<s:Dataflow xmlns:s="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/structure" xmlns:m="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message" xmlns:c="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/common" id="LFST_R_LMDGEG" urn="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=ESTAT:LFST_R_LMDGEG(1.0)" agencyID="ESTAT" version="1.0">
<c:Annotations>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
</c:Annotations>
<c:Name/>
<c:Name/>
<c:Name/>
<s:Structure/>
</s:Dataflow>
list_all_dataflows
returns by default a pandas Series, facilitating data discovery by users like so:
= list_all_dataflows(return_pandas=True)
dflows assert type(dflows) == pd.core.series.Series
dflows
--- SS without structure ---
1 (140728747105224) False
--- <class 'sdmx.message.StructureMessage'> ---
2 (2860529403664) <sdmx.StructureMessage>
<Header>
id: 'C2C8DEB97684424693070ED3366216DD'
prepared: '2024-10-21T11:23:21.319000+00:00'
sender: <Agency ESTAT>
source:
test: False
--- <class 'sdmx.model.common.Annotation'> ---
10 (2860526463632) Annotation(id=None, title='DATASET', type='DISSEMINATION_OBJECT_TYPE', url=None, text=)
14 (2860527187024) Annotation(id=None, title='685', type='OBS_COUNT', url=None, text=)
18 (2860526431696) Annotation(id=None, title='1999', type='OBS_PERIOD_OVERALL_OLDEST', url=None, text=)
22 (2860526429520) Annotation(id=None, title='2022', type='OBS_PERIOD_OVERALL_LATEST', url=None, text=)
26 (2860529408848) Annotation(id=None, title='2020-09-17T13:28:00+0200', type='CREATED', url=None, text=)
30 (2860529411344) Annotation(id=None, title='2023-10-25T23:00:00+0200', type='UPDATE_STRUCTURE', url=None, text=)
34 (2860526462160) Annotation(id=None, title='2023-10-25T23:00:00+0200', type='UPDATE_DATA', url=None, text=)
39 (2860529412048) Annotation(id=None, title='Explanatory texts (metadata)', type='ESMS_HTML', url='https://ec.europa.eu/eurostat/cache/metadata/en/reg_lmk_esms.htm', text=)
44 (2860526361040) Annotation(id=None, title='Explanatory texts (metadata)', type='ESMS_SDMX', url='https://ec.europa.eu/eurostat/api/dissemination/files?file=metadata/reg_lmk_esms.sdmx.zip', text=)
50 (2860527178768) Annotation(id=None, title=None, type='SOURCE_INSTITUTIONS', url=None, text=de: Eurostat
en: Eurostat
fr: Eurostat)
54 (2860526655056) Annotation(id=None, title='<adms:identifier xmlns:adms="http://www.w3.org/ns/adms#" xmlns:skos="http://www.w3.org/2004/02/skos/core.html" xmlns:dct="http://purl.org/dc/terms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><adms:Identifier rdf:about="https://doi.org/10.2908/LFST_R_LMDGEG"><skos:notation rdf:datatype="http://purl.org/spar/datacite/doi">10.2908/LFST_R_LMDGEG</skos:notation><dct:creator rdf:resource="http://publications.europa.eu/resource/authority/corporate-body/ESTAT"/><dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2023-01-19</dct:issued></adms:Identifier></adms:identifier>', type='DISSEMINATION_DOI_XML', url=None, text=)
--- Name ---
55 (2860528919552) ('de', 'Regionale Disparitäten bei den geschlechtsspezifischen Unterschieden in der Beschäftigung (NUTS-Ebene 2)')
56 (2860528936256) ('fr', "Disparités régionales des écarts d'emploi entre les hommes et les femmes (niveau NUTS 2)")
57 (2860526197696) ('en', 'Regional disparities in gender employment gap (NUTS level 2)')
--- Structure ---
LFST_R_LMDGEG (2860526809536) DataStructureDefinition=ESTAT:LFST_R_LMDGEG(26.0) → DataStructureDefinition=LFST_R_LMDGEG
Ignore:
{140728747158792}
<s:Dataflow xmlns:s="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/structure" xmlns:m="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message" xmlns:c="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/common" id="LFST_R_LMDGEG" urn="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=ESTAT:LFST_R_LMDGEG(1.0)" agencyID="ESTAT" version="1.0">
<c:Annotations>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
</c:Annotations>
<c:Name/>
<c:Name/>
<c:Name/>
<s:Structure/>
</s:Dataflow>
ABS ABORIGINAL_POP_PROJ Projected population, Aboriginal and Torres St...
ABORIGINAL_POP_PROJ_REMOTE Projected population, Aboriginal and Torres St...
ABS_ABORIGINAL_POPPROJ_INDREGION Projected population, Aboriginal and Torres St...
ABS_ACLD_LFSTATUS Australian Census Longitudinal Dataset (ACLD):...
ABS_ACLD_TENURE Australian Census Longitudinal Dataset (ACLD):...
...
UNSD DF_UNData_UNFCC SDMX_GHG_UNDATA
WB DF_WITS_Tariff_TRAINS WITS - UNCTAD TRAINS Tariff Data
DF_WITS_TradeStats_Development WITS TradeStats Devlopment
DF_WITS_TradeStats_Tariff WITS TradeStats Tariff
DF_WITS_TradeStats_Trade WITS TradeStats Trade
Name: dataflow, Length: 24655, dtype: object
This format allows for more easily searching dflows
by source:
=True, return_pandas=True) list_all_dataflows(codes_only
--- SS without structure ---
1 (140728747105224) False
--- <class 'sdmx.message.StructureMessage'> ---
2 (2860575408592) <sdmx.StructureMessage>
<Header>
id: 'C2C8DEB97684424693070ED3366216DD'
prepared: '2024-10-21T11:23:21.319000+00:00'
sender: <Agency ESTAT>
source:
test: False
--- <class 'sdmx.model.common.Annotation'> ---
10 (2860575415824) Annotation(id=None, title='DATASET', type='DISSEMINATION_OBJECT_TYPE', url=None, text=)
14 (2860575409424) Annotation(id=None, title='685', type='OBS_COUNT', url=None, text=)
18 (2860577871056) Annotation(id=None, title='1999', type='OBS_PERIOD_OVERALL_OLDEST', url=None, text=)
22 (2860577867088) Annotation(id=None, title='2022', type='OBS_PERIOD_OVERALL_LATEST', url=None, text=)
26 (2860577866320) Annotation(id=None, title='2020-09-17T13:28:00+0200', type='CREATED', url=None, text=)
30 (2860576861328) Annotation(id=None, title='2023-10-25T23:00:00+0200', type='UPDATE_STRUCTURE', url=None, text=)
34 (2860576271760) Annotation(id=None, title='2023-10-25T23:00:00+0200', type='UPDATE_DATA', url=None, text=)
39 (2860577376464) Annotation(id=None, title='Explanatory texts (metadata)', type='ESMS_HTML', url='https://ec.europa.eu/eurostat/cache/metadata/en/reg_lmk_esms.htm', text=)
44 (2860576952656) Annotation(id=None, title='Explanatory texts (metadata)', type='ESMS_SDMX', url='https://ec.europa.eu/eurostat/api/dissemination/files?file=metadata/reg_lmk_esms.sdmx.zip', text=)
50 (2860576755536) Annotation(id=None, title=None, type='SOURCE_INSTITUTIONS', url=None, text=de: Eurostat
en: Eurostat
fr: Eurostat)
54 (2860577875280) Annotation(id=None, title='<adms:identifier xmlns:adms="http://www.w3.org/ns/adms#" xmlns:skos="http://www.w3.org/2004/02/skos/core.html" xmlns:dct="http://purl.org/dc/terms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><adms:Identifier rdf:about="https://doi.org/10.2908/LFST_R_LMDGEG"><skos:notation rdf:datatype="http://purl.org/spar/datacite/doi">10.2908/LFST_R_LMDGEG</skos:notation><dct:creator rdf:resource="http://publications.europa.eu/resource/authority/corporate-body/ESTAT"/><dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2023-01-19</dct:issued></adms:Identifier></adms:identifier>', type='DISSEMINATION_DOI_XML', url=None, text=)
--- Name ---
55 (2860530071360) ('de', 'Regionale Disparitäten bei den geschlechtsspezifischen Unterschieden in der Beschäftigung (NUTS-Ebene 2)')
56 (2860529467264) ('fr', "Disparités régionales des écarts d'emploi entre les hommes et les femmes (niveau NUTS 2)")
57 (2860530507712) ('en', 'Regional disparities in gender employment gap (NUTS level 2)')
--- Structure ---
LFST_R_LMDGEG (2860554480880) DataStructureDefinition=ESTAT:LFST_R_LMDGEG(26.0) → DataStructureDefinition=LFST_R_LMDGEG
Ignore:
{140728747158792}
<s:Dataflow xmlns:s="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/structure" xmlns:m="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message" xmlns:c="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/common" id="LFST_R_LMDGEG" urn="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=ESTAT:LFST_R_LMDGEG(1.0)" agencyID="ESTAT" version="1.0">
<c:Annotations>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
<c:Annotation/>
</c:Annotations>
<c:Name/>
<c:Name/>
<c:Name/>
<s:Structure/>
</s:Dataflow>
ABS 0 ABORIGINAL_POP_PROJ
1 ABORIGINAL_POP_PROJ_REMOTE
2 ABS_ABORIGINAL_POPPROJ_INDREGION
3 ABS_ACLD_LFSTATUS
4 ABS_ACLD_TENURE
...
UNSD 3 DF_UNData_UNFCC
WB 0 DF_WITS_Tariff_TRAINS
1 DF_WITS_TradeStats_Development
2 DF_WITS_TradeStats_Tariff
3 DF_WITS_TradeStats_Trade
Name: dataflow, Length: 24655, dtype: object
'BIS'] dflows[
BIS_REL_CAL BIS_RELEASE_CALENDAR
WS_CBPOL Policy rate
WS_CBS_PUB BIS consolidated banking
WS_CBTA Central bank total assets
WS_CPMI_CASHLESS CPMI cashless payments (T5,T6)
WS_CPMI_CT1 CPMI comparative tables type 1
WS_CPMI_CT2 CPMI comparative tables type 2
WS_CPMI_DEVICES CPMI payment devices (T4)
WS_CPMI_INSTITUT CPMI institutions (T3)
WS_CPMI_MACRO CPMI macro (T1,T2)
WS_CPMI_PARTICIP CPMI participants (T7,T10,T12,T15)
WS_CPMI_SYSTEMS CPMI systems (T8,T9,T11,T13,T14,T16,T17,T18,T19)
WS_CPP Commercial property prices
WS_CREDIT_GAP BIS credit-to-GDP gaps
WS_DEBT_SEC2_PUB BIS international debt securities (BIS-compiled)
WS_DER_OTC_TOV OTC derivatives turnover
WS_DPP Detailed residential property prices
WS_DSR BIS debt service ratio
WS_EER BIS effective exchange rates
WS_GLI Global liquidity indicators
WS_LBS_D_PUB BIS locational banking
WS_LONG_CPI BIS long consumer prices
WS_NA_SEC_C3 BIS debt securities statistics
WS_NA_SEC_DSS BIS Debt securities statistics
WS_OTC_DERIV2 OTC derivatives outstanding
WS_SPP Selected residential property prices
WS_TC BIS long series on total credit
WS_XRU US dollar exchange rates
WS_XTD_DERIV Exchange traded derivatives
Name: dataflow, dtype: object
Or the user can search dataflows by their human-readable name instead of their code. For example, this is one way to see if any dataflow has information on interest rates:
str.contains('Interest rate', case=False)] dflows[dflows.
ECB IRS Interest rate statistics
MIR MFI Interest Rate Statistics
RIR Retail Interest Rates
ESTAT EI_MFIR_M Interest rates - monthly data
ENPE_IRT_LD Loan and deposit one year interest rate
ENPE_IRT_ST Money market interest rates
TEIMF040 3-month-interest rate
TEIMF100 Day-to-day money market interest rates
IRT_ST_A Money market interest rates - annual data
IRT_ST_M Money market interest rates - monthly data
IRT_ST_Q Money market interest rates - quarterly data
IMF 6SR M&B: Interest Rates and Share Prices (6SR) for...
INR Interest rates
INR_NSTD Interest rates_Non-Standard
NB GOVT_GENERIC_RATES Generic interest rates
GOVT_IRS Interest rate swaps
Name: dataflow, dtype: object
The function load_SDMX_data
is a convenience function that downloads data from SDMX sources (and any specific dataflows passed as arguments) if they match the key and parameters set by the user.
load_SDMX_data
load_SDMX_data
(sources: dict, keys: dict, params: dict, verbose: bool = True)
Loads datasets from SDMX. Args: sources (dict): A dictionary with the sources and dataflows per source. keys (dict): The keys to be used in the SDMX query. params (dict): The parameters to be used in the SDMX query. verbose (bool): Whether to communicate download steps to the user. Returns: A pandas DataFrame with data from SDMX or None if no data matches the sources, keys, and parameters.
= load_SDMX_data(sources={'ECB': 'CISS', 'BIS': 'WS_CBPOL_D'}, keys={'FREQ': 'D'}, params={'startPeriod': 2003})
df
assert type(df) == pd.DataFrame
assert df.shape[0] > 0
assert df.shape[1] > 0
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
Querying data from BIS's dataflow 'WS_CBPOL' - Policy rate...
Temporal features
Temporal features, such as the day of the week, month, or hour, provide valuable information for time series data, helping to capture seasonality, trends, and cyclic patterns. These features are especially useful because they represent known future information that can enhance model predictions. The gingado library offers the get_timefeat
method to extract these features from a time series:
get_timefeat
get_timefeat
(df: pandas.core.frame.DataFrame | pandas.core.series.Series, freq: str | gingado.internals.Frequency, columns: list[str] | None = None, add_to_df: bool = True) -> pandas.core.frame.DataFrame
Generate temporal features from a DataFrame with a DatetimeIndex. This function creates various time-based features such as day of week, day of month, week of year, etc., based on the DatetimeIndex of the input DataFrame. Args: df (pd.DataFrame | pd.Series): Input DataFrame or Series with a DatetimeIndex. freq (FrequencyLike): Frequency of the input DataFrame. Can either be a string which is a supported pandas frequency alias or an gingado-interal Frequency object. columns (list[str], optional): List of colums with temporal feature names that should be kept. If None, all default temporal features are returned. Defaults to None. add_to_df (bool, optional): If True, append the generated features to the input DataFrame. If False, return only the generated features. Defaults to True. Returns: pd.DataFrame: A DataFrame containing the generated temporal features, either appended to the input DataFrame or as a separate DataFrame. Raises: ValueError: If the input DataFrame's index is not a DatetimeIndex.
For instance, using daily data from a DataFrame:
# Display the first few rows of the DataFrame
display(df.head())
# Extract temporal features for daily data
= get_timefeat(df, freq="D", add_to_df=False)
temporal display(temporal.head())
ECB__CISS_D__AT__Z0Z__4F__EC__SS_CIN__IDX | ECB__CISS_D__BE__Z0Z__4F__EC__SS_CIN__IDX | ECB__CISS_D__CN__Z0Z__4F__EC__SS_CIN__IDX | ECB__CISS_D__DE__Z0Z__4F__EC__SS_CIN__IDX | ECB__CISS_D__ES__Z0Z__4F__EC__SS_CIN__IDX | ECB__CISS_D__FI__Z0Z__4F__EC__SS_CIN__IDX | ECB__CISS_D__FR__Z0Z__4F__EC__SS_CIN__IDX | ECB__CISS_D__GB__Z0Z__4F__EC__SS_CIN__IDX | ECB__CISS_D__IE__Z0Z__4F__EC__SS_CIN__IDX | ECB__CISS_D__IT__Z0Z__4F__EC__SS_CIN__IDX | ... | BIS__WS_CBPOL_D__TR | BIS__WS_CBPOL_D__US | BIS__WS_CBPOL_D__XM | BIS__WS_CBPOL_D__ZA | BIS__WS_CBPOL_D__AU | BIS__WS_CBPOL_D__AR | BIS__WS_CBPOL_D__CH | BIS__WS_CBPOL_D__CL | BIS__WS_CBPOL_D__CN | BIS__WS_CBPOL_D__CO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TIME_PERIOD | |||||||||||||||||||||
2003-01-01 | 0.017774 | 0.042273 | NaN | 0.107753 | 0.028552 | 0.053814 | 0.005528 | 0.061118 | 0.004191 | 0.057108 | ... | NaN | 1.25 | 2.75 | NaN | NaN | NaN | 0.75 | NaN | 5.31 | 5.25 |
2003-01-02 | 0.023427 | 0.047823 | NaN | 0.148028 | 0.039988 | 0.075186 | 0.013415 | 0.048480 | 0.014820 | 0.064289 | ... | 44.0 | 1.25 | 2.75 | 13.5 | 4.75 | 5.99 | 0.75 | 3.0 | 5.31 | 5.25 |
2003-01-03 | 0.021899 | 0.043292 | NaN | 0.141700 | 0.040378 | 0.077400 | 0.014249 | 0.047644 | 0.016874 | 0.064880 | ... | 44.0 | 1.25 | 2.75 | 13.5 | 4.75 | 6.05 | 0.75 | 3.0 | 5.31 | 5.25 |
2003-01-04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 1.25 | 2.75 | 13.5 | NaN | NaN | NaN | NaN | 5.31 | 5.25 |
2003-01-05 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 1.25 | 2.75 | NaN | NaN | NaN | NaN | NaN | 5.31 | 5.25 |
5 rows × 61 columns
day_of_week | day_of_month | day_of_quarter | day_of_year | week_of_month | week_of_quarter | week_of_year | month_of_quarter | month_of_year | quarter_of_year | quarter_end | year_end | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
TIME_PERIOD | ||||||||||||
2003-01-01 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
2003-01-02 | 3 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
2003-01-03 | 4 | 3 | 3 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
2003-01-04 | 5 | 4 | 4 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
2003-01-05 | 6 | 5 | 5 | 5 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
You can also integrate the temporal features directly into the original DataFrame by setting the add_to_df
parameter to True:
# Generate a sample DataFrame with a weekly index
= pd.DataFrame(
df_weekly ={"value": rng.normal(size=100)},
data=pd.date_range('2000-01-01', periods=100, freq='W-MON')
index
)
# Add temporal features to the weekly data
= get_timefeat(df_weekly, freq="W", add_to_df=True)
df_with_timefeat display(df_with_timefeat.head())
value | week_of_month | week_of_quarter | week_of_year | month_of_quarter | month_of_year | quarter_of_year | quarter_end | year_end | |
---|---|---|---|---|---|---|---|---|---|
2000-01-03 | 0.304717 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
2000-01-10 | -1.039984 | 2 | 2 | 2 | 1 | 1 | 1 | 0 | 0 |
2000-01-17 | 0.750451 | 3 | 3 | 3 | 1 | 1 | 1 | 0 | 0 |
2000-01-24 | 0.940565 | 4 | 4 | 4 | 1 | 1 | 1 | 0 | 0 |
2000-01-31 | -1.951035 | 5 | 5 | 5 | 1 | 1 | 1 | 0 | 0 |
If you only need a subset of the temporal features, you can specify the desired feature names:
# Generate a new DataFrame with a monthly index
= pd.DataFrame(
df_monthly ={"value": rng.normal(size=24)},
data=pd.date_range("2023-01-01", periods=24, freq='MS')
index
)# Only select a subset of temporal features:
= get_timefeat(df_monthly, freq="MS", columns=["month_of_year", "quarter_of_year"])
df_with_timefeat display(df_with_timefeat.head())
value | month_of_year | quarter_of_year | |
---|---|---|---|
2023-01-01 | -0.378163 | 1 | 1 |
2023-02-01 | 1.299228 | 2 | 1 |
2023-03-01 | -0.356264 | 3 | 1 |
2023-04-01 | 0.737516 | 4 | 2 |
2023-05-01 | -0.933618 | 5 | 2 |
In addition to get_timefeat
, the gingado library provides the TemporalFeatureTransformer
class, which can be used to transform a DataFrame with a temporal index into a DataFrame with additional features:
= TemporalFeatureTransformer(freq="W", features=["week_of_month", "week_of_year", "quarter_of_year"])
temp_trf = temp_trf.fit_transform(df_weekly)
df_with_timefeat display(df_with_timefeat.head())
value | week_of_month | week_of_year | quarter_of_year | |
---|---|---|---|---|
2000-01-03 | 0.304717 | 1 | 1 | 1 |
2000-01-10 | -1.039984 | 2 | 2 | 1 |
2000-01-17 | 0.750451 | 3 | 3 | 1 |
2000-01-24 | 0.940565 | 4 | 4 | 1 |
2000-01-31 | -1.951035 | 5 | 5 | 1 |