from gingado.datasets import load_BarroLee_1994
Estimators
In many instances, economists are interested in using machine learning models for specific purposes that go beyond their ability to predict variables to a good accuracy. For example:
understanding the relationship between covariates and the outcome, usually to demonstrate that a non-trivial effect of one variable on another exists;
identifying which covariates are related or not to a certain outcome, often to demonstrate the relevance of a certain theory;
estimating a certain measure with certain desirable statistical and econometric properties, such as causal inference, where the object of interest is the predicted outcome of an adapted algorithm; and
process non-traditional data (eg, text) for inclusion in a traditional econometrics regression, especially useful in settings where measurable quantitative data is complemented with this other type of data.
The gingado.estimators
module contains machine learning algorithms adapted to enable the types of analyses described above. More estimators can be expected over time.
For more academic discussions of machine learning methods in economics covering a broad range of topics, see Athey and Imbens (2019).
Covariate selection
Clustering
The clustering algorithms used below are not themselves adapted from the general use methods. Rather, the functions offer convenience functionalities to find and retain the other variables in the same cluster.
These variables are usually entities (individuals, countries, stocks, etc) in a larger population.
The gingado
clustering routines are designed to allow users standalone usage, or a seamless integration as part of a pipeline.
There are three levels of sophistication that users can choose from:
using the off-the-shelf clustering routines provided by
gingado
, which were selected to be applied cross various use cases;selecting an existing clustering routine from the
scikit-learn.cluster
module; ordesigning their own clustering algorithm.
FindCluster
FindCluster
(cluster_alg: '[BaseEstimator, ClusterMixin]' = AffinityPropagation(), auto_document: 'ggdModelDocumentation' = <class 'gingado.model_documentation.ModelCard'>, random_state: 'int | None' = None)
Retain only the columns of `X` that are in the same cluster as `y`. Args: cluster_alg (BaseEstimator|ClusterMixin): An instance of the clustering algorithm to use. auto_document (ggdModelDocumentation): gingado Documenter template to facilitate model documentation. random_state (int|None): The random seed to be used by the algorithm, if relevant. Defaults to None.
fit
fit
(self, X, y)
Fit `FindCluster`. Args: X: The population of entities, organized in columns. y: The entity of interest.
transform
transform
(self, X) -> 'np.array'
Keep only the entities in `X` that belong to the same cluster as `y`. Args: X: The population of entities, organized in columns. Returns: np.array: Columns of `X` that are in the same cluster as `y`.
fit_transform
fit_transform
(self, X, y) -> 'np.array'
Fit a `FindCluster` object and keep only the entities in `X` that belong to the same cluster as `y`. Args: X: The population of entities, organized in columns. y: The entity of interest. Returns: np.array: Columns of `X` that are in the same cluster as `y`.
document
document
(self, documenter: 'ggdModelDocumentation | None' = None)
Document the `FindCluster` model using the template in `documenter`. Args: documenter (ggdModelDocumentation|None): A gingado Documenter or the documenter set in `auto_document` if None. Defaults to None.
Example: finding similar countries
The Barro and Lee (1994) dataset is used to illustrate the use of FindCluster
. It is a country-level dataset. Let’s use it to answer the following question: for some specific country, what other countries are the closest to it considering the data available?
First, we import the data:
The data is organized by rows: each row is a different country, and the variables are organised in columns.
The dataset is originally organised for a regression of GDP growth (here denoted y
) on the covariates (X
). This is not what we want to do in this case. So instead of keeping GDP as a separate variable, the next step is to include it in the X
DataFrame.
= load_BarroLee_1994()
X, y 'gdp'] = y
X[ X.head()
Unnamed: 0 | gdpsh465 | bmp1l | freeop | freetar | h65 | hm65 | hf65 | p65 | pm65 | ... | syr65 | syrm65 | syrf65 | teapri65 | teasec65 | ex1 | im1 | xr65 | tot1 | gdp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 6.591674 | 0.2837 | 0.153491 | 0.043888 | 0.007 | 0.013 | 0.001 | 0.29 | 0.37 | ... | 0.033 | 0.057 | 0.010 | 47.6 | 17.3 | 0.0729 | 0.0667 | 0.348 | -0.014727 | -0.024336 |
1 | 1 | 6.829794 | 0.6141 | 0.313509 | 0.061827 | 0.019 | 0.032 | 0.007 | 0.91 | 1.00 | ... | 0.173 | 0.274 | 0.067 | 57.1 | 18.0 | 0.0940 | 0.1438 | 0.525 | 0.005750 | 0.100473 |
2 | 2 | 8.895082 | 0.0000 | 0.204244 | 0.009186 | 0.260 | 0.325 | 0.201 | 1.00 | 1.00 | ... | 2.573 | 2.478 | 2.667 | 26.5 | 20.7 | 0.1741 | 0.1750 | 1.082 | -0.010040 | 0.067051 |
3 | 3 | 7.565275 | 0.1997 | 0.248714 | 0.036270 | 0.061 | 0.070 | 0.051 | 1.00 | 1.00 | ... | 0.438 | 0.453 | 0.424 | 27.8 | 22.7 | 0.1265 | 0.1496 | 6.625 | -0.002195 | 0.064089 |
4 | 4 | 7.162397 | 0.1740 | 0.299252 | 0.037367 | 0.017 | 0.027 | 0.007 | 0.82 | 0.85 | ... | 0.257 | 0.287 | 0.229 | 34.5 | 17.6 | 0.1211 | 0.1308 | 2.500 | 0.003283 | 0.027930 |
5 rows × 63 columns
Now we remove the first column (an identifier) and transpose the DataFrame, so that countries are organized in columns.
Each country is identified by a number: 0, 1, …
= X.iloc[:, 1:]
X = X.T
countries = ['country_' + str(c) for c in countries.columns]
countries.columns countries.head()
country_0 | country_1 | country_2 | country_3 | country_4 | country_5 | country_6 | country_7 | country_8 | country_9 | ... | country_80 | country_81 | country_82 | country_83 | country_84 | country_85 | country_86 | country_87 | country_88 | country_89 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gdpsh465 | 6.591674 | 6.829794 | 8.895082 | 7.565275 | 7.162397 | 7.218910 | 7.853605 | 7.703910 | 9.063463 | 8.151910 | ... | 9.030974 | 8.995537 | 8.234830 | 8.332549 | 8.645586 | 8.991064 | 8.025189 | 9.030137 | 8.865312 | 8.912339 |
bmp1l | 0.283700 | 0.614100 | 0.000000 | 0.199700 | 0.174000 | 0.000000 | 0.000000 | 0.277600 | 0.000000 | 0.148400 | ... | 0.000000 | 0.000000 | 0.036300 | 0.000000 | 0.000000 | 0.000000 | 0.005000 | 0.000000 | 0.000000 | 0.000000 |
freeop | 0.153491 | 0.313509 | 0.204244 | 0.248714 | 0.299252 | 0.258865 | 0.182525 | 0.215275 | 0.109614 | 0.110885 | ... | 0.293138 | 0.304720 | 0.288405 | 0.345485 | 0.288440 | 0.371898 | 0.296437 | 0.265778 | 0.282939 | 0.150366 |
freetar | 0.043888 | 0.061827 | 0.009186 | 0.036270 | 0.037367 | 0.020880 | 0.014385 | 0.029713 | 0.002171 | 0.028579 | ... | 0.005517 | 0.011658 | 0.011589 | 0.006503 | 0.005995 | 0.014586 | 0.013615 | 0.008629 | 0.005048 | 0.024377 |
h65 | 0.007000 | 0.019000 | 0.260000 | 0.061000 | 0.017000 | 0.023000 | 0.039000 | 0.024000 | 0.402000 | 0.145000 | ... | 0.245000 | 0.246000 | 0.183000 | 0.188000 | 0.256000 | 0.255000 | 0.108000 | 0.288000 | 0.188000 | 0.257000 |
5 rows × 90 columns
Suppose we are interested in country No 13. What other countries are similar to it?
First, country No 13 needs to be carved out of the DataFrame with the other countries.
Second, we can now pass the larger DataFrame and country 13’s data separately to an instance of FindCluster
.
= countries.pop('country_13') country_of_interest
= FindCluster(AffinityPropagation(convergence_iter=5000))
similar similar
FindCluster(cluster_alg=AffinityPropagation(convergence_iter=5000))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FindCluster(cluster_alg=AffinityPropagation(convergence_iter=5000))
AffinityPropagation(convergence_iter=5000)
AffinityPropagation(convergence_iter=5000)
= similar.fit_transform(X=countries, y=country_of_interest)
same_cluster
assert same_cluster.equals(similar.fit(X=countries, y=country_of_interest).transform(X=countries))
same_cluster
country_2 | country_9 | country_41 | country_48 | country_49 | country_52 | country_60 | country_64 | country_66 | |
---|---|---|---|---|---|---|---|---|---|
gdpsh465 | 8.895082 | 8.151910 | 7.360740 | 6.469250 | 5.762051 | 9.224933 | 8.346168 | 7.655864 | 7.830028 |
bmp1l | 0.000000 | 0.148400 | 0.418100 | 0.538800 | 0.600500 | 0.000000 | 0.319900 | 0.134500 | 0.488000 |
freeop | 0.204244 | 0.110885 | 0.218471 | 0.153491 | 0.151848 | 0.204244 | 0.110885 | 0.164598 | 0.136287 |
freetar | 0.009186 | 0.028579 | 0.027087 | 0.043888 | 0.024100 | 0.009186 | 0.028579 | 0.044446 | 0.046730 |
h65 | 0.260000 | 0.145000 | 0.032000 | 0.015000 | 0.002000 | 0.393000 | 0.272000 | 0.080000 | 0.146000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ex1 | 0.174100 | 0.052400 | 0.190500 | 0.069200 | 0.148400 | 0.255800 | 0.062500 | 0.052500 | 0.076400 |
im1 | 0.175000 | 0.052300 | 0.225700 | 0.074800 | 0.186400 | 0.241200 | 0.057800 | 0.057200 | 0.086600 |
xr65 | 1.082000 | 2.119000 | 3.949000 | 0.348000 | 7.367000 | 1.017000 | 36.603000 | 30.929000 | 40.500000 |
tot1 | -0.010040 | 0.007584 | 0.205768 | 0.035226 | 0.007548 | 0.018636 | 0.014286 | -0.004592 | -0.007018 |
gdp | 0.067051 | 0.039147 | 0.016775 | -0.048712 | 0.024477 | 0.050757 | -0.034045 | 0.046010 | -0.011384 |
62 rows × 9 columns
The default clustering algorithm used by FindCluster
is affinity propagation (Frey and Dueck 2007). It is the algorithm of choice because of it combines several desireable characteristics, in particular: - the number of clusters is data-driven instad of set by the user, - the number of entities in each cluster is also chosen by the model, - all entities are part of a cluster, and - each cluster might have a different number of entities.
However, we may want to try different clustering algorithms. Let’s compare the result above with the same analyses using DBSCAN (Ester et al. 1996).
from sklearn.cluster import DBSCAN
= FindCluster(cluster_alg=DBSCAN())
similar_dbscan similar_dbscan
FindCluster(cluster_alg=DBSCAN())In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FindCluster(cluster_alg=DBSCAN())
DBSCAN()
DBSCAN()
= similar_dbscan.fit_transform(X=countries, y=country_of_interest)
same_cluster_dbscan
assert same_cluster_dbscan.equals(similar_dbscan.fit(X=countries, y=country_of_interest).transform(X=countries))
same_cluster_dbscan
country_0 | country_1 | country_2 | country_3 | country_4 | country_5 | country_6 | country_7 | country_8 | country_9 | ... | country_80 | country_81 | country_82 | country_83 | country_84 | country_85 | country_86 | country_87 | country_88 | country_89 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gdpsh465 | 6.591674 | 6.829794 | 8.895082 | 7.565275 | 7.162397 | 7.218910 | 7.853605 | 7.703910 | 9.063463 | 8.151910 | ... | 9.030974 | 8.995537 | 8.234830 | 8.332549 | 8.645586 | 8.991064 | 8.025189 | 9.030137 | 8.865312 | 8.912339 |
bmp1l | 0.283700 | 0.614100 | 0.000000 | 0.199700 | 0.174000 | 0.000000 | 0.000000 | 0.277600 | 0.000000 | 0.148400 | ... | 0.000000 | 0.000000 | 0.036300 | 0.000000 | 0.000000 | 0.000000 | 0.005000 | 0.000000 | 0.000000 | 0.000000 |
freeop | 0.153491 | 0.313509 | 0.204244 | 0.248714 | 0.299252 | 0.258865 | 0.182525 | 0.215275 | 0.109614 | 0.110885 | ... | 0.293138 | 0.304720 | 0.288405 | 0.345485 | 0.288440 | 0.371898 | 0.296437 | 0.265778 | 0.282939 | 0.150366 |
freetar | 0.043888 | 0.061827 | 0.009186 | 0.036270 | 0.037367 | 0.020880 | 0.014385 | 0.029713 | 0.002171 | 0.028579 | ... | 0.005517 | 0.011658 | 0.011589 | 0.006503 | 0.005995 | 0.014586 | 0.013615 | 0.008629 | 0.005048 | 0.024377 |
h65 | 0.007000 | 0.019000 | 0.260000 | 0.061000 | 0.017000 | 0.023000 | 0.039000 | 0.024000 | 0.402000 | 0.145000 | ... | 0.245000 | 0.246000 | 0.183000 | 0.188000 | 0.256000 | 0.255000 | 0.108000 | 0.288000 | 0.188000 | 0.257000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ex1 | 0.072900 | 0.094000 | 0.174100 | 0.126500 | 0.121100 | 0.063400 | 0.034200 | 0.086400 | 0.059400 | 0.052400 | ... | 0.166200 | 0.259700 | 0.104400 | 0.286600 | 0.129600 | 0.440700 | 0.166900 | 0.323800 | 0.184500 | 0.187600 |
im1 | 0.066700 | 0.143800 | 0.175000 | 0.149600 | 0.130800 | 0.076200 | 0.042800 | 0.093100 | 0.046000 | 0.052300 | ... | 0.161700 | 0.228800 | 0.179600 | 0.350000 | 0.145800 | 0.425700 | 0.220100 | 0.313400 | 0.194000 | 0.200700 |
xr65 | 0.348000 | 0.525000 | 1.082000 | 6.625000 | 2.500000 | 1.000000 | 12.499000 | 7.000000 | 1.000000 | 2.119000 | ... | 4.286000 | 2.460000 | 32.051000 | 0.452000 | 652.850000 | 2.529000 | 25.553000 | 4.152000 | 0.452000 | 0.886000 |
tot1 | -0.014727 | 0.005750 | -0.010040 | -0.002195 | 0.003283 | -0.001747 | 0.009092 | 0.011630 | 0.008169 | 0.007584 | ... | -0.006642 | -0.003241 | -0.034352 | -0.001660 | -0.046278 | -0.011883 | -0.039080 | 0.005175 | -0.029551 | -0.036482 |
gdp | -0.024336 | 0.100473 | 0.067051 | 0.064089 | 0.027930 | 0.046407 | 0.067332 | 0.020978 | 0.033551 | 0.039147 | ... | 0.038095 | 0.034213 | 0.052759 | 0.038416 | 0.031895 | 0.031196 | 0.034096 | 0.046900 | 0.039773 | 0.040642 |
62 rows × 89 columns
As illustrated above, the results can be quite different. In this case, affinity propagation converged to more tightly defined clusters, while DBSCAN selected a cluster that contains almost all other countries (therefore, not useful in this particular case).
Note that model documentation is already jumpstarted when the cluster is fit. A glimpse of the current template, including the questions in the documentation template that have been automatically filled, are shown below.
similar.model_documentation.show_json()
{'model_details': {'developer': 'Person or organisation developing the model',
'datetime': '2024-02-27 08:49:13 ',
'version': 'Model version',
'type': 'Model type',
'info': {'_estimator_type': 'clusterer',
'affinity_matrix_': array([[-4.23373922e+08, -5.97375771e+07, -5.35974361e+07, ...,
-1.92434215e+09, -8.60822083e+07, -3.77976931e+07],
[-5.97375771e+07, -4.23373922e+08, -2.26471602e+08, ...,
-2.66217555e+09, -2.43057326e+06, -1.92555486e+08],
[-5.35974361e+07, -2.26471602e+08, -4.23373922e+08, ...,
-1.33575671e+09, -2.75395788e+08, -1.37934978e+06],
...,
[-1.92434215e+09, -2.66217555e+09, -1.33575671e+09, ...,
-4.23373922e+08, -2.82418157e+09, -1.42280304e+09],
[-8.60822083e+07, -2.43057326e+06, -2.75395788e+08, ...,
-2.82418157e+09, -4.23373922e+08, -2.37881124e+08],
[-3.77976931e+07, -1.92555486e+08, -1.37934978e+06, ...,
-1.42280304e+09, -2.37881124e+08, -4.23373922e+08]]),
'cluster_centers_': array([[ 6.82979374e+00, 6.14100000e-01, 3.13509000e-01, ...,
5.25000000e-01, 5.75000000e-03, 1.00472567e-01],
[ 8.89508153e+00, 0.00000000e+00, 2.04244000e-01, ...,
1.08200000e+00, -1.00400000e-02, 6.70514822e-02],
[ 7.56527528e+00, 1.99700000e-01, 2.48714000e-01, ...,
6.62500000e+00, -2.19500000e-03, 6.40891662e-02],
...,
[ 8.33254894e+00, 0.00000000e+00, 3.45485000e-01, ...,
4.52000000e-01, -1.66000000e-03, 3.84156381e-02],
[ 8.86531163e+00, 0.00000000e+00, 2.82939000e-01, ...,
4.52000000e-01, -2.95510000e-02, 3.97733722e-02],
[ 8.91233857e+00, 0.00000000e+00, 1.50366000e-01, ...,
8.86000000e-01, -3.64820000e-02, 4.06415381e-02]]),
'cluster_centers_indices_': array([ 1, 2, 3, 4, 5, 7, 8, 10, 13, 14, 16, 18, 19, 25, 27, 32, 35,
39, 42, 45, 46, 49, 50, 52, 53, 55, 57, 58, 60, 62, 67, 68, 69, 71,
76, 82, 87, 88], dtype=int64),
'feature_names_in_': array(['gdpsh465', 'bmp1l', 'freeop', 'freetar', 'h65', 'hm65', 'hf65',
'p65', 'pm65', 'pf65', 's65', 'sm65', 'sf65', 'fert65', 'mort65',
'lifee065', 'gpop1', 'fert1', 'mort1', 'invsh41', 'geetot1',
'geerec1', 'gde1', 'govwb1', 'govsh41', 'gvxdxe41', 'high65',
'highm65', 'highf65', 'highc65', 'highcm65', 'highcf65', 'human65',
'humanm65', 'humanf65', 'hyr65', 'hyrm65', 'hyrf65', 'no65',
'nom65', 'nof65', 'pinstab1', 'pop65', 'worker65', 'pop1565',
'pop6565', 'sec65', 'secm65', 'secf65', 'secc65', 'seccm65',
'seccf65', 'syr65', 'syrm65', 'syrf65', 'teapri65', 'teasec65',
'ex1', 'im1', 'xr65', 'tot1', 'gdp'], dtype=object),
'labels_': array([29, 0, 1, 2, 3, 4, 18, 5, 6, 1, 7, 30, 14, 8, 9, 29, 10,
29, 11, 12, 12, 18, 29, 36, 18, 13, 18, 14, 29, 36, 36, 14, 15, 36,
29, 16, 18, 14, 36, 17, 1, 14, 18, 29, 29, 19, 20, 1, 1, 21, 22,
1, 23, 24, 21, 25, 36, 26, 27, 1, 28, 12, 29, 1, 14, 1, 29, 30,
31, 32, 12, 33, 18, 29, 30, 18, 34, 14, 18, 36, 36, 29, 35, 36, 29,
29, 14, 36, 37, 1], dtype=int64),
'n_features_in_': 62,
'n_iter_': 200},
'paper': 'Paper or other resource for more information',
'citation': 'Citation details',
'license': 'License',
'contact': 'Where to send questions or comments about the model'},
'intended_use': {'primary_uses': 'Primary intended uses',
'primary_users': 'Primary intended users',
'out_of_scope': 'Out-of-scope use cases'},
'factors': {'relevant': 'Relevant factors',
'evaluation': 'Evaluation factors'},
'metrics': {'performance_measures': 'Model performance measures',
'thresholds': 'Decision thresholds',
'variation_approaches': 'Variation approaches'},
'evaluation_data': {'datasets': 'Datasets',
'motivation': 'Motivation',
'preprocessing': 'Preprocessing'},
'training_data': {'training_data': 'Information on training data'},
'quant_analyses': {'unitary': 'Unitary results',
'intersectional': 'Intersectional results'},
'ethical_considerations': {'sensitive_data': 'Does the model use any sensitive data (e.g., protected classes)?',
'human_life': 'Is the model intended to inform decisions about matters central to human life or flourishing - e.g., health or safety? Or could it be used in such a way?',
'mitigations': 'What risk mitigation strategies were used during model development?',
'risks_and_harms': 'What risks may be present in model usage? Try to identify the potential recipients,likelihood, and magnitude of harms. If these cannot be determined, note that they were considered but remain unknown',
'use_cases': 'Are there any known model use cases that are especially fraught?',
'additional_information': 'If possible, this section should also include any additional ethical considerations that went into model development, for example, review by an external board, or testing with a specific community.'},
'caveats_recommendations': {'caveats': 'For example, did the results suggest any further testing? Were there any relevant groups that were not represented in the evaluation dataset?',
'recommendations': 'Are there additional recommendations for model use? What are the ideal characteristics of an evaluation dataset for this model?'}}
FindCluster
can also be used as part of a pipeline
. In this case, only the entities in the same cluster as the entity of interest will continue on to the next steps of the estimation.
from gingado.benchmark import RegressionBenchmark
from sklearn.pipeline import Pipeline
= Pipeline([
pipe 'cluster', FindCluster(AffinityPropagation(convergence_iter=5000))),
('rf', RegressionBenchmark())
( ])
=countries, y=country_of_interest) pipe.fit(X
Pipeline(steps=[('cluster', FindCluster(cluster_alg=AffinityPropagation(convergence_iter=5000))), ('rf', RegressionBenchmark(cv=ShuffleSplit(n_splits=10, random_state=None, test_size=None, train_size=None)))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('cluster', FindCluster(cluster_alg=AffinityPropagation(convergence_iter=5000))), ('rf', RegressionBenchmark(cv=ShuffleSplit(n_splits=10, random_state=None, test_size=None, train_size=None)))])
FindCluster(cluster_alg=AffinityPropagation(convergence_iter=5000))
AffinityPropagation(convergence_iter=5000)
AffinityPropagation(convergence_iter=5000)
RegressionBenchmark(cv=ShuffleSplit(n_splits=10, random_state=None, test_size=None, train_size=None))
RandomForestRegressor(oob_score=True)
RandomForestRegressor(oob_score=True)
Causal inference
Comparative case studies
MachineControl
MachineControl
(cluster_alg: '[BaseEstimator, ClusterMixin] | None' = AffinityPropagation(), estimator: 'BaseEstimator' = RegressionBenchmark(), manifold: 'BaseEstimator' = TSNE(), with_placebo: 'bool' = True, auto_document: 'ggdModelDocumentation' = <class 'gingado.model_documentation.ModelCard'>, random_state: 'int | None' = None)
Synthetic controls with machine learning methods Args: cluster_alg (BaseEstimator | ClusterMixin | None): An instance of the clustering algorithm to use, or None to retain all entities. estimator (BaseEstimator): Method to weight the control entities. manifold (BaseEstimator): Algorithm for manifold learning. with_placebo (bool): Include placebo estimations during prediction? auto_document (ggdModelDocumentation): gingado Documenter template to facilitate model documentation. random_state (int | None): The random seed to be used by the algorithm, if relevant.
fit
fit
(self, X: 'pd.DataFrame', y: 'pd.DataFrame | pd.Series')
Fit the `MachineControl` model. Args: X (pd.DataFrame): A pandas DataFrame with pre-intervention data of shape (n_samples, n_control_entities). y (pd.DataFrame | pd.Series): A pandas DataFrame or Series with pre-intervention data of shape (n_samples,).
predict
predict
(self, X: 'pd.DataFrame', y: 'pd.DataFrame | pd.Series')
Calculate the model predictions before and after the intervention. Args: X (pd.DataFrame): A pandas DataFrame with complete time series (pre- and post-intervention) of shape (n_samples, n_control_entities). y (pd.DataFrame | pd.Series): A pandas DataFrame or Series with complete time series of shape (n_samples,).
get_controls
get_controls
(self)
Get the list of control entities
document
document
(self, documenter: 'ggdModelDocumentation | None' = None)
Document the `MachineControl` model using the template in `documenter`. Args: documenter (ggdModelDocumentation | None): A gingado Documenter or the documenter set in `auto_document` if None.
Brief econometric description
The goal of MachineControl
is to estimate:
\[ \tau_t = Y_{1, t}^{I} - Y_{1, t}^{N}, t > T0 \]
where:
\(\tau\) is the effect on entity \(i=1\) of the intervention of interest
without loss of generality, \(i=1\) is an entity that has undergone the intervention of interest, amongst \(N\) total entities
time period \(T0\) is a date in which the intervention occurred
superscript \(I\) in an outcome variable denotes the occurence of the intervention, whereas superscript \(N\) is absence of intervention
for \(t > T0\), \(Y_{i, t}^{I}\) is observed while \(Y_{i, t}^{N}\) must be estimated because it is a counterfacual.
\(Y_{i, t}^{N}\) is calculated from the values of the other entities, \(i \neq 1\). Collect this data in a vector \(\mathbb{Y}_{-1, t}^{N}\). Then, following Doudchenko and Imbens (2016):
\[ \hat{Y}_{i, t}^{N} = f^*(\mathbb{Y}_{-1, t}^{N}), \]
with the star (\(*\)) superscript on the function \(f(\cdot)\) representing that it was trained only with data up until the intervention date. The exact form of \(f(\cdot)\) depends on the argument estimator
. A general use estimator is the random forest (Breiman 2001).
The panel data itself might be the whole population in the data, or a subset when using the whole population might be too cumbersome to run analyses (eg, if the data contains too many entities). One way to select this subsample of control units without including subjective judgment in the data is quantitatilve. The control units are selected through a clustering algorithm (argument cluster_arg
). One cluster algorithm that can be used is affinity propagation (Frey and Dueck 2007).
To finalise, the quality of the synthetic control can be assessed in many ways. One fully data-driven way to achieve this is by using manifold learning: lower-dimensional embeddings of a higher-dimensional data. A preferred manifold learning algorithm is t-SNE (Van der Maaten and Hinton 2008).
The relative distance between embeddings and the target centre, as well as the control and the target, represent the chance that a better feasible control (either from real or combined) will materialise. The intuition behind this test is:
let \(d_{i,j}\) be the Euclidean distance between the embeddings (2d points) of entities \(i\) and \(j\)
if only a very small percentage of \(d_{1, j \in (2, ..., N)}\) are lower than \(d_{1, \text{Synthetic control}}\), than the synthetic control produced with \(f(\cdot)\) is indeed a formula that provides one of the best alternative.
Main references:
Example: impact of labour reform on productivity
See Machine controls: Synthetic controls with machine learning.