
Machine learning-based estimators of economic models

In many instances, economists are interested in using machine learning models for specific purposes that go beyond their ability to predict variables to a good accuracy. For example:

The gingado.estimators module contains machine learning algorithms adapted to enable the types of analyses described above. More estimators can be expected over time.

For more academic discussions of machine learning methods in economics covering a broad range of topics, see Athey and Imbens (2019).

Covariate selection


The clustering algorithms used below are not themselves adapted from the general use methods. Rather, the functions offer convenience functionalities to find and retain the other variables in the same cluster.

These variables are usually entities (individuals, countries, stocks, etc) in a larger population.

The gingado clustering routines are designed to allow users standalone usage, or a seamless integration as part of a pipeline.

There are three levels of sophistication that users can choose from:

  • using the off-the-shelf clustering routines provided by gingado, which were selected to be applied cross various use cases;

  • selecting an existing clustering routine from the scikit-learn.cluster module; or

  • designing their own clustering algorithm.


FindCluster (cluster_alg: '[BaseEstimator, ClusterMixin]' = AffinityPropagation(), auto_document: 'ggdModelDocumentation' = <class 'gingado.model_documentation.ModelCard'>, random_state: 'int | None' = None)

Retain only the columns of `X` that are in the same cluster as `y`.

    cluster_alg (BaseEstimator|ClusterMixin): An instance of the clustering algorithm to use.
    auto_document (ggdModelDocumentation): gingado Documenter template to facilitate model documentation.
    random_state (int|None): The random seed to be used by the algorithm, if relevant. Defaults to None.


fit (self, X, y)

Fit `FindCluster`.

    X: The population of entities, organized in columns.
    y: The entity of interest.


transform (self, X) -> 'np.array'

Keep only the entities in `X` that belong to the same cluster as `y`.

    X: The population of entities, organized in columns.

    np.array: Columns of `X` that are in the same cluster as `y`.


fit_transform (self, X, y) -> 'np.array'

Fit a `FindCluster` object and keep only the entities in `X` that belong to the same cluster as `y`.

    X: The population of entities, organized in columns.
    y: The entity of interest.

    np.array: Columns of `X` that are in the same cluster as `y`.


document (self, documenter: 'ggdModelDocumentation | None' = None)

Document the `FindCluster` model using the template in `documenter`.

    documenter (ggdModelDocumentation|None): A gingado Documenter or the documenter set in `auto_document` if None.
        Defaults to None.

Example: finding similar countries

The Barro and Lee (1994) dataset is used to illustrate the use of FindCluster. It is a country-level dataset. Let’s use it to answer the following question: for some specific country, what other countries are the closest to it considering the data available?

First, we import the data:

from gingado.datasets import load_BarroLee_1994

The data is organized by rows: each row is a different country, and the variables are organised in columns.

The dataset is originally organised for a regression of GDP growth (here denoted y) on the covariates (X). This is not what we want to do in this case. So instead of keeping GDP as a separate variable, the next step is to include it in the X DataFrame.

X, y = load_BarroLee_1994()
X['gdp'] = y
Unnamed: 0 gdpsh465 bmp1l freeop freetar h65 hm65 hf65 p65 pm65 ... syr65 syrm65 syrf65 teapri65 teasec65 ex1 im1 xr65 tot1 gdp
0 0 6.591674 0.2837 0.153491 0.043888 0.007 0.013 0.001 0.29 0.37 ... 0.033 0.057 0.010 47.6 17.3 0.0729 0.0667 0.348 -0.014727 -0.024336
1 1 6.829794 0.6141 0.313509 0.061827 0.019 0.032 0.007 0.91 1.00 ... 0.173 0.274 0.067 57.1 18.0 0.0940 0.1438 0.525 0.005750 0.100473
2 2 8.895082 0.0000 0.204244 0.009186 0.260 0.325 0.201 1.00 1.00 ... 2.573 2.478 2.667 26.5 20.7 0.1741 0.1750 1.082 -0.010040 0.067051
3 3 7.565275 0.1997 0.248714 0.036270 0.061 0.070 0.051 1.00 1.00 ... 0.438 0.453 0.424 27.8 22.7 0.1265 0.1496 6.625 -0.002195 0.064089
4 4 7.162397 0.1740 0.299252 0.037367 0.017 0.027 0.007 0.82 0.85 ... 0.257 0.287 0.229 34.5 17.6 0.1211 0.1308 2.500 0.003283 0.027930

5 rows × 63 columns

Now we remove the first column (an identifier) and transpose the DataFrame, so that countries are organized in columns.

Each country is identified by a number: 0, 1, …

X = X.iloc[:, 1:]
countries = X.T
countries.columns = ['country_' + str(c) for c in countries.columns]
country_0 country_1 country_2 country_3 country_4 country_5 country_6 country_7 country_8 country_9 ... country_80 country_81 country_82 country_83 country_84 country_85 country_86 country_87 country_88 country_89
gdpsh465 6.591674 6.829794 8.895082 7.565275 7.162397 7.218910 7.853605 7.703910 9.063463 8.151910 ... 9.030974 8.995537 8.234830 8.332549 8.645586 8.991064 8.025189 9.030137 8.865312 8.912339
bmp1l 0.283700 0.614100 0.000000 0.199700 0.174000 0.000000 0.000000 0.277600 0.000000 0.148400 ... 0.000000 0.000000 0.036300 0.000000 0.000000 0.000000 0.005000 0.000000 0.000000 0.000000
freeop 0.153491 0.313509 0.204244 0.248714 0.299252 0.258865 0.182525 0.215275 0.109614 0.110885 ... 0.293138 0.304720 0.288405 0.345485 0.288440 0.371898 0.296437 0.265778 0.282939 0.150366
freetar 0.043888 0.061827 0.009186 0.036270 0.037367 0.020880 0.014385 0.029713 0.002171 0.028579 ... 0.005517 0.011658 0.011589 0.006503 0.005995 0.014586 0.013615 0.008629 0.005048 0.024377
h65 0.007000 0.019000 0.260000 0.061000 0.017000 0.023000 0.039000 0.024000 0.402000 0.145000 ... 0.245000 0.246000 0.183000 0.188000 0.256000 0.255000 0.108000 0.288000 0.188000 0.257000

5 rows × 90 columns

Suppose we are interested in country No 13. What other countries are similar to it?

First, country No 13 needs to be carved out of the DataFrame with the other countries.

Second, we can now pass the larger DataFrame and country 13’s data separately to an instance of FindCluster.

country_of_interest = countries.pop('country_13')
similar = FindCluster(AffinityPropagation(convergence_iter=5000))
same_cluster = similar.fit_transform(X=countries, y=country_of_interest)

assert same_cluster.equals(, y=country_of_interest).transform(X=countries))

country_2 country_9 country_41 country_48 country_49 country_52 country_60 country_64 country_66
gdpsh465 8.895082 8.151910 7.360740 6.469250 5.762051 9.224933 8.346168 7.655864 7.830028
bmp1l 0.000000 0.148400 0.418100 0.538800 0.600500 0.000000 0.319900 0.134500 0.488000
freeop 0.204244 0.110885 0.218471 0.153491 0.151848 0.204244 0.110885 0.164598 0.136287
freetar 0.009186 0.028579 0.027087 0.043888 0.024100 0.009186 0.028579 0.044446 0.046730
h65 0.260000 0.145000 0.032000 0.015000 0.002000 0.393000 0.272000 0.080000 0.146000
... ... ... ... ... ... ... ... ... ...
ex1 0.174100 0.052400 0.190500 0.069200 0.148400 0.255800 0.062500 0.052500 0.076400
im1 0.175000 0.052300 0.225700 0.074800 0.186400 0.241200 0.057800 0.057200 0.086600
xr65 1.082000 2.119000 3.949000 0.348000 7.367000 1.017000 36.603000 30.929000 40.500000
tot1 -0.010040 0.007584 0.205768 0.035226 0.007548 0.018636 0.014286 -0.004592 -0.007018
gdp 0.067051 0.039147 0.016775 -0.048712 0.024477 0.050757 -0.034045 0.046010 -0.011384

62 rows × 9 columns

The default clustering algorithm used by FindCluster is affinity propagation (Frey and Dueck 2007). It is the algorithm of choice because of it combines several desireable characteristics, in particular: - the number of clusters is data-driven instad of set by the user, - the number of entities in each cluster is also chosen by the model, - all entities are part of a cluster, and - each cluster might have a different number of entities.

However, we may want to try different clustering algorithms. Let’s compare the result above with the same analyses using DBSCAN (Ester et al. 1996).

from sklearn.cluster import DBSCAN
similar_dbscan = FindCluster(cluster_alg=DBSCAN())
same_cluster_dbscan = similar_dbscan.fit_transform(X=countries, y=country_of_interest)

assert same_cluster_dbscan.equals(, y=country_of_interest).transform(X=countries))

country_0 country_1 country_2 country_3 country_4 country_5 country_6 country_7 country_8 country_9 ... country_80 country_81 country_82 country_83 country_84 country_85 country_86 country_87 country_88 country_89
gdpsh465 6.591674 6.829794 8.895082 7.565275 7.162397 7.218910 7.853605 7.703910 9.063463 8.151910 ... 9.030974 8.995537 8.234830 8.332549 8.645586 8.991064 8.025189 9.030137 8.865312 8.912339
bmp1l 0.283700 0.614100 0.000000 0.199700 0.174000 0.000000 0.000000 0.277600 0.000000 0.148400 ... 0.000000 0.000000 0.036300 0.000000 0.000000 0.000000 0.005000 0.000000 0.000000 0.000000
freeop 0.153491 0.313509 0.204244 0.248714 0.299252 0.258865 0.182525 0.215275 0.109614 0.110885 ... 0.293138 0.304720 0.288405 0.345485 0.288440 0.371898 0.296437 0.265778 0.282939 0.150366
freetar 0.043888 0.061827 0.009186 0.036270 0.037367 0.020880 0.014385 0.029713 0.002171 0.028579 ... 0.005517 0.011658 0.011589 0.006503 0.005995 0.014586 0.013615 0.008629 0.005048 0.024377
h65 0.007000 0.019000 0.260000 0.061000 0.017000 0.023000 0.039000 0.024000 0.402000 0.145000 ... 0.245000 0.246000 0.183000 0.188000 0.256000 0.255000 0.108000 0.288000 0.188000 0.257000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ex1 0.072900 0.094000 0.174100 0.126500 0.121100 0.063400 0.034200 0.086400 0.059400 0.052400 ... 0.166200 0.259700 0.104400 0.286600 0.129600 0.440700 0.166900 0.323800 0.184500 0.187600
im1 0.066700 0.143800 0.175000 0.149600 0.130800 0.076200 0.042800 0.093100 0.046000 0.052300 ... 0.161700 0.228800 0.179600 0.350000 0.145800 0.425700 0.220100 0.313400 0.194000 0.200700
xr65 0.348000 0.525000 1.082000 6.625000 2.500000 1.000000 12.499000 7.000000 1.000000 2.119000 ... 4.286000 2.460000 32.051000 0.452000 652.850000 2.529000 25.553000 4.152000 0.452000 0.886000
tot1 -0.014727 0.005750 -0.010040 -0.002195 0.003283 -0.001747 0.009092 0.011630 0.008169 0.007584 ... -0.006642 -0.003241 -0.034352 -0.001660 -0.046278 -0.011883 -0.039080 0.005175 -0.029551 -0.036482
gdp -0.024336 0.100473 0.067051 0.064089 0.027930 0.046407 0.067332 0.020978 0.033551 0.039147 ... 0.038095 0.034213 0.052759 0.038416 0.031895 0.031196 0.034096 0.046900 0.039773 0.040642

62 rows × 89 columns

As illustrated above, the results can be quite different. In this case, affinity propagation converged to more tightly defined clusters, while DBSCAN selected a cluster that contains almost all other countries (therefore, not useful in this particular case).

Note that model documentation is already jumpstarted when the cluster is fit.

FindCluster can also be used as part of a pipeline. In this case, only the entities in the same cluster as the entity of interest will continue on to the next steps of the estimation.

from gingado.benchmark import RegressionBenchmark
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('cluster', FindCluster(AffinityPropagation(convergence_iter=5000))),
    ('rf', RegressionBenchmark())
]), y=country_of_interest)
                 RegressionBenchmark(cv=ShuffleSplit(n_splits=10, random_state=None, test_size=None, train_size=None)))])
Causal inference

Comparative case studies


MachineControl (cluster_alg: '[BaseEstimator, ClusterMixin] | None' = AffinityPropagation(), estimator: 'BaseEstimator' = RegressionBenchmark(), manifold: 'BaseEstimator' = TSNE(), with_placebo: 'bool' = True, auto_document: 'ggdModelDocumentation' = <class 'gingado.model_documentation.ModelCard'>, random_state: 'int | None' = None)

Synthetic controls with machine learning methods

    cluster_alg (BaseEstimator | ClusterMixin | None): An instance of the clustering algorithm to use, or None to retain all entities.
    estimator (BaseEstimator): Method to weight the control entities.
    manifold (BaseEstimator): Algorithm for manifold learning.
    with_placebo (bool): Include placebo estimations during prediction?
    auto_document (ggdModelDocumentation): gingado Documenter template to facilitate model documentation.
    random_state (int | None): The random seed to be used by the algorithm, if relevant.


fit (self, X: 'pd.DataFrame', y: 'pd.DataFrame | pd.Series')

Fit the `MachineControl` model.

    X (pd.DataFrame): A pandas DataFrame with pre-intervention data of shape (n_samples, n_control_entities).
    y (pd.DataFrame | pd.Series): A pandas DataFrame or Series with pre-intervention data of shape (n_samples,).


predict (self, X: 'pd.DataFrame', y: 'pd.DataFrame | pd.Series')

Calculate the model predictions before and after the intervention.

    X (pd.DataFrame): A pandas DataFrame with complete time series (pre- and post-intervention) of shape (n_samples, n_control_entities).
    y (pd.DataFrame | pd.Series): A pandas DataFrame or Series with complete time series of shape (n_samples,).


get_controls (self)

Get the list of control entities


document (self, documenter: 'ggdModelDocumentation | None' = None)

Document the `MachineControl` model using the template in `documenter`.

    documenter (ggdModelDocumentation | None): A gingado Documenter or the documenter set in `auto_document` if None.

Brief econometric description

The goal of MachineControl is to estimate:

\[ \tau_t = Y_{1, t}^{I} - Y_{1, t}^{N}, t > T0 \]


  • \(\tau\) is the effect on entity \(i=1\) of the intervention of interest

  • without loss of generality, \(i=1\) is an entity that has undergone the intervention of interest, amongst \(N\) total entities

  • time period \(T0\) is a date in which the intervention occurred

  • superscript \(I\) in an outcome variable denotes the occurence of the intervention, whereas superscript \(N\) is absence of intervention

  • for \(t > T0\), \(Y_{i, t}^{I}\) is observed while \(Y_{i, t}^{N}\) must be estimated because it is a counterfacual.

\(Y_{i, t}^{N}\) is calculated from the values of the other entities, \(i \neq 1\). Collect this data in a vector \(\mathbb{Y}_{-1, t}^{N}\). Then, following Doudchenko and Imbens (2016):

\[ \hat{Y}_{i, t}^{N} = f^*(\mathbb{Y}_{-1, t}^{N}), \]

with the star (\(*\)) superscript on the function \(f(\cdot)\) representing that it was trained only with data up until the intervention date. The exact form of \(f(\cdot)\) depends on the argument estimator. A general use estimator is the random forest (Breiman 2001).

The panel data itself might be the whole population in the data, or a subset when using the whole population might be too cumbersome to run analyses (eg, if the data contains too many entities). One way to select this subsample of control units without including subjective judgment in the data is quantitatilve. The control units are selected through a clustering algorithm (argument cluster_arg). One cluster algorithm that can be used is affinity propagation (Frey and Dueck 2007).

To finalise, the quality of the synthetic control can be assessed in many ways. One fully data-driven way to achieve this is by using manifold learning: lower-dimensional embeddings of a higher-dimensional data. A preferred manifold learning algorithm is t-SNE (Van der Maaten and Hinton 2008).

The relative distance between embeddings and the target centre, as well as the control and the target, represent the chance that a better feasible control (either from real or combined) will materialise. The intuition behind this test is:

  • let \(d_{i,j}\) be the Euclidean distance between the embeddings (2d points) of entities \(i\) and \(j\)

  • if only a very small percentage of \(d_{1, j \in (2, ..., N)}\) are lower than \(d_{1, \text{Synthetic control}}\), than the synthetic control produced with \(f(\cdot)\) is indeed a formula that provides one of the best alternative.

Main references:

  • Abadie and Gardeazabal (2003)
  • Abadie, Diamond, and Hainmueller (2010)
  • Abadie, Diamond, and Hainmueller (2015)
  • Doudchenko and Imbens (2016)
  • Abadie (2021)

Example: impact of labour reform on productivity

See Machine controls: Synthetic controls with machine learning.


