Estimators

Machine learning-based estimators of economic models

In many instances, economists are interested in using machine learning models for specific purposes that go beyond their ability to predict variables to a good accuracy. For example:

The gingado.estimators module contains machine learning algorithms adapted to enable the types of analyses described above. More estimators can be expected over time.

For more academic discussions of machine learning methods in economics covering a broad range of topics, see Athey and Imbens (2019).

Covariate selection

Clustering

The clustering algorithms used below are not themselves adapted from the general use methods. Rather, the functions offer convenience functionalities to find and retain the other variables in the same cluster.

These variables are usually entities (individuals, countries, stocks, etc) in a larger population.

The gingado clustering routines are designed to allow users standalone usage, or a seamless integration as part of a pipeline.

There are three levels of sophistication that users can choose from:

  • using the off-the-shelf clustering routines provided by gingado, which were selected to be applied cross various use cases;

  • selecting an existing clustering routine from the scikit-learn.cluster module; or

  • designing their own clustering algorithm.

FindCluster

FindCluster (cluster_alg: '[BaseEstimator, ClusterMixin]' = AffinityPropagation(), auto_document: 'ggdModelDocumentation' = <class 'gingado.model_documentation.ModelCard'>, random_state: 'int | None' = None)

Retain only the columns of `X` that are in the same cluster as `y`.

Args:
    cluster_alg (BaseEstimator|ClusterMixin): An instance of the clustering algorithm to use.
    auto_document (ggdModelDocumentation): gingado Documenter template to facilitate model documentation.
    random_state (int|None): The random seed to be used by the algorithm, if relevant. Defaults to None.

fit

fit (self, X, y)

Fit `FindCluster`.

Args:
    X: The population of entities, organized in columns.
    y: The entity of interest.

transform

transform (self, X) -> 'np.array'

Keep only the entities in `X` that belong to the same cluster as `y`.

Args:
    X: The population of entities, organized in columns.

Returns:
    np.array: Columns of `X` that are in the same cluster as `y`.

fit_transform

fit_transform (self, X, y) -> 'np.array'

Fit a `FindCluster` object and keep only the entities in `X` that belong to the same cluster as `y`.

Args:
    X: The population of entities, organized in columns.
    y: The entity of interest.

Returns:
    np.array: Columns of `X` that are in the same cluster as `y`.

document

document (self, documenter: 'ggdModelDocumentation | None' = None)

Document the `FindCluster` model using the template in `documenter`.

Args:
    documenter (ggdModelDocumentation|None): A gingado Documenter or the documenter set in `auto_document` if None.
        Defaults to None.

Example: finding similar countries

The Barro and Lee (1994) dataset is used to illustrate the use of FindCluster. It is a country-level dataset. Let’s use it to answer the following question: for some specific country, what other countries are the closest to it considering the data available?

First, we import the data:

from gingado.datasets import load_BarroLee_1994

The data is organized by rows: each row is a different country, and the variables are organised in columns.

The dataset is originally organised for a regression of GDP growth (here denoted y) on the covariates (X). This is not what we want to do in this case. So instead of keeping GDP as a separate variable, the next step is to include it in the X DataFrame.

X, y = load_BarroLee_1994()
X['gdp'] = y
X.head()
Unnamed: 0 gdpsh465 bmp1l freeop freetar h65 hm65 hf65 p65 pm65 ... syr65 syrm65 syrf65 teapri65 teasec65 ex1 im1 xr65 tot1 gdp
0 0 6.591674 0.2837 0.153491 0.043888 0.007 0.013 0.001 0.29 0.37 ... 0.033 0.057 0.010 47.6 17.3 0.0729 0.0667 0.348 -0.014727 -0.024336
1 1 6.829794 0.6141 0.313509 0.061827 0.019 0.032 0.007 0.91 1.00 ... 0.173 0.274 0.067 57.1 18.0 0.0940 0.1438 0.525 0.005750 0.100473
2 2 8.895082 0.0000 0.204244 0.009186 0.260 0.325 0.201 1.00 1.00 ... 2.573 2.478 2.667 26.5 20.7 0.1741 0.1750 1.082 -0.010040 0.067051
3 3 7.565275 0.1997 0.248714 0.036270 0.061 0.070 0.051 1.00 1.00 ... 0.438 0.453 0.424 27.8 22.7 0.1265 0.1496 6.625 -0.002195 0.064089
4 4 7.162397 0.1740 0.299252 0.037367 0.017 0.027 0.007 0.82 0.85 ... 0.257 0.287 0.229 34.5 17.6 0.1211 0.1308 2.500 0.003283 0.027930

5 rows × 63 columns

Now we remove the first column (an identifier) and transpose the DataFrame, so that countries are organized in columns.

Each country is identified by a number: 0, 1, …

X = X.iloc[:, 1:]
countries = X.T
countries.columns = ['country_' + str(c) for c in countries.columns]
countries.head()
country_0 country_1 country_2 country_3 country_4 country_5 country_6 country_7 country_8 country_9 ... country_80 country_81 country_82 country_83 country_84 country_85 country_86 country_87 country_88 country_89
gdpsh465 6.591674 6.829794 8.895082 7.565275 7.162397 7.218910 7.853605 7.703910 9.063463 8.151910 ... 9.030974 8.995537 8.234830 8.332549 8.645586 8.991064 8.025189 9.030137 8.865312 8.912339
bmp1l 0.283700 0.614100 0.000000 0.199700 0.174000 0.000000 0.000000 0.277600 0.000000 0.148400 ... 0.000000 0.000000 0.036300 0.000000 0.000000 0.000000 0.005000 0.000000 0.000000 0.000000
freeop 0.153491 0.313509 0.204244 0.248714 0.299252 0.258865 0.182525 0.215275 0.109614 0.110885 ... 0.293138 0.304720 0.288405 0.345485 0.288440 0.371898 0.296437 0.265778 0.282939 0.150366
freetar 0.043888 0.061827 0.009186 0.036270 0.037367 0.020880 0.014385 0.029713 0.002171 0.028579 ... 0.005517 0.011658 0.011589 0.006503 0.005995 0.014586 0.013615 0.008629 0.005048 0.024377
h65 0.007000 0.019000 0.260000 0.061000 0.017000 0.023000 0.039000 0.024000 0.402000 0.145000 ... 0.245000 0.246000 0.183000 0.188000 0.256000 0.255000 0.108000 0.288000 0.188000 0.257000

5 rows × 90 columns

Suppose we are interested in country No 13. What other countries are similar to it?

First, country No 13 needs to be carved out of the DataFrame with the other countries.

Second, we can now pass the larger DataFrame and country 13’s data separately to an instance of FindCluster.

country_of_interest = countries.pop('country_13')
similar = FindCluster(AffinityPropagation(convergence_iter=5000))
similar
FindCluster(cluster_alg=AffinityPropagation(convergence_iter=5000))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
same_cluster = similar.fit_transform(X=countries, y=country_of_interest)

assert same_cluster.equals(similar.fit(X=countries, y=country_of_interest).transform(X=countries))

same_cluster
country_2 country_9 country_41 country_48 country_49 country_52 country_60 country_64 country_66
gdpsh465 8.895082 8.151910 7.360740 6.469250 5.762051 9.224933 8.346168 7.655864 7.830028
bmp1l 0.000000 0.148400 0.418100 0.538800 0.600500 0.000000 0.319900 0.134500 0.488000
freeop 0.204244 0.110885 0.218471 0.153491 0.151848 0.204244 0.110885 0.164598 0.136287
freetar 0.009186 0.028579 0.027087 0.043888 0.024100 0.009186 0.028579 0.044446 0.046730
h65 0.260000 0.145000 0.032000 0.015000 0.002000 0.393000 0.272000 0.080000 0.146000
... ... ... ... ... ... ... ... ... ...
ex1 0.174100 0.052400 0.190500 0.069200 0.148400 0.255800 0.062500 0.052500 0.076400
im1 0.175000 0.052300 0.225700 0.074800 0.186400 0.241200 0.057800 0.057200 0.086600
xr65 1.082000 2.119000 3.949000 0.348000 7.367000 1.017000 36.603000 30.929000 40.500000
tot1 -0.010040 0.007584 0.205768 0.035226 0.007548 0.018636 0.014286 -0.004592 -0.007018
gdp 0.067051 0.039147 0.016775 -0.048712 0.024477 0.050757 -0.034045 0.046010 -0.011384

62 rows × 9 columns

The default clustering algorithm used by FindCluster is affinity propagation (Frey and Dueck 2007). It is the algorithm of choice because of it combines several desireable characteristics, in particular: - the number of clusters is data-driven instad of set by the user, - the number of entities in each cluster is also chosen by the model, - all entities are part of a cluster, and - each cluster might have a different number of entities.

However, we may want to try different clustering algorithms. Let’s compare the result above with the same analyses using DBSCAN (Ester et al. 1996).

from sklearn.cluster import DBSCAN
similar_dbscan = FindCluster(cluster_alg=DBSCAN())
similar_dbscan
FindCluster(cluster_alg=DBSCAN())
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
same_cluster_dbscan = similar_dbscan.fit_transform(X=countries, y=country_of_interest)

assert same_cluster_dbscan.equals(similar_dbscan.fit(X=countries, y=country_of_interest).transform(X=countries))

same_cluster_dbscan
country_0 country_1 country_2 country_3 country_4 country_5 country_6 country_7 country_8 country_9 ... country_80 country_81 country_82 country_83 country_84 country_85 country_86 country_87 country_88 country_89
gdpsh465 6.591674 6.829794 8.895082 7.565275 7.162397 7.218910 7.853605 7.703910 9.063463 8.151910 ... 9.030974 8.995537 8.234830 8.332549 8.645586 8.991064 8.025189 9.030137 8.865312 8.912339
bmp1l 0.283700 0.614100 0.000000 0.199700 0.174000 0.000000 0.000000 0.277600 0.000000 0.148400 ... 0.000000 0.000000 0.036300 0.000000 0.000000 0.000000 0.005000 0.000000 0.000000 0.000000
freeop 0.153491 0.313509 0.204244 0.248714 0.299252 0.258865 0.182525 0.215275 0.109614 0.110885 ... 0.293138 0.304720 0.288405 0.345485 0.288440 0.371898 0.296437 0.265778 0.282939 0.150366
freetar 0.043888 0.061827 0.009186 0.036270 0.037367 0.020880 0.014385 0.029713 0.002171 0.028579 ... 0.005517 0.011658 0.011589 0.006503 0.005995 0.014586 0.013615 0.008629 0.005048 0.024377
h65 0.007000 0.019000 0.260000 0.061000 0.017000 0.023000 0.039000 0.024000 0.402000 0.145000 ... 0.245000 0.246000 0.183000 0.188000 0.256000 0.255000 0.108000 0.288000 0.188000 0.257000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ex1 0.072900 0.094000 0.174100 0.126500 0.121100 0.063400 0.034200 0.086400 0.059400 0.052400 ... 0.166200 0.259700 0.104400 0.286600 0.129600 0.440700 0.166900 0.323800 0.184500 0.187600
im1 0.066700 0.143800 0.175000 0.149600 0.130800 0.076200 0.042800 0.093100 0.046000 0.052300 ... 0.161700 0.228800 0.179600 0.350000 0.145800 0.425700 0.220100 0.313400 0.194000 0.200700
xr65 0.348000 0.525000 1.082000 6.625000 2.500000 1.000000 12.499000 7.000000 1.000000 2.119000 ... 4.286000 2.460000 32.051000 0.452000 652.850000 2.529000 25.553000 4.152000 0.452000 0.886000
tot1 -0.014727 0.005750 -0.010040 -0.002195 0.003283 -0.001747 0.009092 0.011630 0.008169 0.007584 ... -0.006642 -0.003241 -0.034352 -0.001660 -0.046278 -0.011883 -0.039080 0.005175 -0.029551 -0.036482
gdp -0.024336 0.100473 0.067051 0.064089 0.027930 0.046407 0.067332 0.020978 0.033551 0.039147 ... 0.038095 0.034213 0.052759 0.038416 0.031895 0.031196 0.034096 0.046900 0.039773 0.040642

62 rows × 89 columns

As illustrated above, the results can be quite different. In this case, affinity propagation converged to more tightly defined clusters, while DBSCAN selected a cluster that contains almost all other countries (therefore, not useful in this particular case).

Note that model documentation is already jumpstarted when the cluster is fit. A glimpse of the current template, including the questions in the documentation template that have been automatically filled, are shown below.

similar.model_documentation.show_json()
{'model_details': {'developer': 'Person or organisation developing the model',
  'datetime': '2024-02-27 08:49:13 ',
  'version': 'Model version',
  'type': 'Model type',
  'info': {'_estimator_type': 'clusterer',
   'affinity_matrix_': array([[-4.23373922e+08, -5.97375771e+07, -5.35974361e+07, ...,
           -1.92434215e+09, -8.60822083e+07, -3.77976931e+07],
          [-5.97375771e+07, -4.23373922e+08, -2.26471602e+08, ...,
           -2.66217555e+09, -2.43057326e+06, -1.92555486e+08],
          [-5.35974361e+07, -2.26471602e+08, -4.23373922e+08, ...,
           -1.33575671e+09, -2.75395788e+08, -1.37934978e+06],
          ...,
          [-1.92434215e+09, -2.66217555e+09, -1.33575671e+09, ...,
           -4.23373922e+08, -2.82418157e+09, -1.42280304e+09],
          [-8.60822083e+07, -2.43057326e+06, -2.75395788e+08, ...,
           -2.82418157e+09, -4.23373922e+08, -2.37881124e+08],
          [-3.77976931e+07, -1.92555486e+08, -1.37934978e+06, ...,
           -1.42280304e+09, -2.37881124e+08, -4.23373922e+08]]),
   'cluster_centers_': array([[ 6.82979374e+00,  6.14100000e-01,  3.13509000e-01, ...,
            5.25000000e-01,  5.75000000e-03,  1.00472567e-01],
          [ 8.89508153e+00,  0.00000000e+00,  2.04244000e-01, ...,
            1.08200000e+00, -1.00400000e-02,  6.70514822e-02],
          [ 7.56527528e+00,  1.99700000e-01,  2.48714000e-01, ...,
            6.62500000e+00, -2.19500000e-03,  6.40891662e-02],
          ...,
          [ 8.33254894e+00,  0.00000000e+00,  3.45485000e-01, ...,
            4.52000000e-01, -1.66000000e-03,  3.84156381e-02],
          [ 8.86531163e+00,  0.00000000e+00,  2.82939000e-01, ...,
            4.52000000e-01, -2.95510000e-02,  3.97733722e-02],
          [ 8.91233857e+00,  0.00000000e+00,  1.50366000e-01, ...,
            8.86000000e-01, -3.64820000e-02,  4.06415381e-02]]),
   'cluster_centers_indices_': array([ 1,  2,  3,  4,  5,  7,  8, 10, 13, 14, 16, 18, 19, 25, 27, 32, 35,
          39, 42, 45, 46, 49, 50, 52, 53, 55, 57, 58, 60, 62, 67, 68, 69, 71,
          76, 82, 87, 88], dtype=int64),
   'feature_names_in_': array(['gdpsh465', 'bmp1l', 'freeop', 'freetar', 'h65', 'hm65', 'hf65',
          'p65', 'pm65', 'pf65', 's65', 'sm65', 'sf65', 'fert65', 'mort65',
          'lifee065', 'gpop1', 'fert1', 'mort1', 'invsh41', 'geetot1',
          'geerec1', 'gde1', 'govwb1', 'govsh41', 'gvxdxe41', 'high65',
          'highm65', 'highf65', 'highc65', 'highcm65', 'highcf65', 'human65',
          'humanm65', 'humanf65', 'hyr65', 'hyrm65', 'hyrf65', 'no65',
          'nom65', 'nof65', 'pinstab1', 'pop65', 'worker65', 'pop1565',
          'pop6565', 'sec65', 'secm65', 'secf65', 'secc65', 'seccm65',
          'seccf65', 'syr65', 'syrm65', 'syrf65', 'teapri65', 'teasec65',
          'ex1', 'im1', 'xr65', 'tot1', 'gdp'], dtype=object),
   'labels_': array([29,  0,  1,  2,  3,  4, 18,  5,  6,  1,  7, 30, 14,  8,  9, 29, 10,
          29, 11, 12, 12, 18, 29, 36, 18, 13, 18, 14, 29, 36, 36, 14, 15, 36,
          29, 16, 18, 14, 36, 17,  1, 14, 18, 29, 29, 19, 20,  1,  1, 21, 22,
           1, 23, 24, 21, 25, 36, 26, 27,  1, 28, 12, 29,  1, 14,  1, 29, 30,
          31, 32, 12, 33, 18, 29, 30, 18, 34, 14, 18, 36, 36, 29, 35, 36, 29,
          29, 14, 36, 37,  1], dtype=int64),
   'n_features_in_': 62,
   'n_iter_': 200},
  'paper': 'Paper or other resource for more information',
  'citation': 'Citation details',
  'license': 'License',
  'contact': 'Where to send questions or comments about the model'},
 'intended_use': {'primary_uses': 'Primary intended uses',
  'primary_users': 'Primary intended users',
  'out_of_scope': 'Out-of-scope use cases'},
 'factors': {'relevant': 'Relevant factors',
  'evaluation': 'Evaluation factors'},
 'metrics': {'performance_measures': 'Model performance measures',
  'thresholds': 'Decision thresholds',
  'variation_approaches': 'Variation approaches'},
 'evaluation_data': {'datasets': 'Datasets',
  'motivation': 'Motivation',
  'preprocessing': 'Preprocessing'},
 'training_data': {'training_data': 'Information on training data'},
 'quant_analyses': {'unitary': 'Unitary results',
  'intersectional': 'Intersectional results'},
 'ethical_considerations': {'sensitive_data': 'Does the model use any sensitive data (e.g., protected classes)?',
  'human_life': 'Is the model intended to inform decisions about matters central to human life or flourishing - e.g., health or safety? Or could it be used in such a way?',
  'mitigations': 'What risk mitigation strategies were used during model development?',
  'risks_and_harms': 'What risks may be present in model usage? Try to identify the potential recipients,likelihood, and magnitude of harms. If these cannot be determined, note that they were considered but remain unknown',
  'use_cases': 'Are there any known model use cases that are especially fraught?',
  'additional_information': 'If possible, this section should also include any additional ethical considerations that went into model development, for example, review by an external board, or testing with a specific community.'},
 'caveats_recommendations': {'caveats': 'For example, did the results suggest any further testing? Were there any relevant groups that were not represented in the evaluation dataset?',
  'recommendations': 'Are there additional recommendations for model use? What are the ideal characteristics of an evaluation dataset for this model?'}}

FindCluster can also be used as part of a pipeline. In this case, only the entities in the same cluster as the entity of interest will continue on to the next steps of the estimation.

from gingado.benchmark import RegressionBenchmark
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('cluster', FindCluster(AffinityPropagation(convergence_iter=5000))),
    ('rf', RegressionBenchmark())
])
pipe.fit(X=countries, y=country_of_interest)
Pipeline(steps=[('cluster',
                 FindCluster(cluster_alg=AffinityPropagation(convergence_iter=5000))),
                ('rf',
                 RegressionBenchmark(cv=ShuffleSplit(n_splits=10, random_state=None, test_size=None, train_size=None)))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Causal inference

Comparative case studies

MachineControl

MachineControl (cluster_alg: '[BaseEstimator, ClusterMixin] | None' = AffinityPropagation(), estimator: 'BaseEstimator' = RegressionBenchmark(), manifold: 'BaseEstimator' = TSNE(), with_placebo: 'bool' = True, auto_document: 'ggdModelDocumentation' = <class 'gingado.model_documentation.ModelCard'>, random_state: 'int | None' = None)

Synthetic controls with machine learning methods

Args:
    cluster_alg (BaseEstimator | ClusterMixin | None): An instance of the clustering algorithm to use, or None to retain all entities.
    estimator (BaseEstimator): Method to weight the control entities.
    manifold (BaseEstimator): Algorithm for manifold learning.
    with_placebo (bool): Include placebo estimations during prediction?
    auto_document (ggdModelDocumentation): gingado Documenter template to facilitate model documentation.
    random_state (int | None): The random seed to be used by the algorithm, if relevant.

fit

fit (self, X: 'pd.DataFrame', y: 'pd.DataFrame | pd.Series')

Fit the `MachineControl` model.

Args:
    X (pd.DataFrame): A pandas DataFrame with pre-intervention data of shape (n_samples, n_control_entities).
    y (pd.DataFrame | pd.Series): A pandas DataFrame or Series with pre-intervention data of shape (n_samples,).

predict

predict (self, X: 'pd.DataFrame', y: 'pd.DataFrame | pd.Series')

Calculate the model predictions before and after the intervention.

Args:
    X (pd.DataFrame): A pandas DataFrame with complete time series (pre- and post-intervention) of shape (n_samples, n_control_entities).
    y (pd.DataFrame | pd.Series): A pandas DataFrame or Series with complete time series of shape (n_samples,).

get_controls

get_controls (self)

Get the list of control entities

document

document (self, documenter: 'ggdModelDocumentation | None' = None)

Document the `MachineControl` model using the template in `documenter`.

Args:
    documenter (ggdModelDocumentation | None): A gingado Documenter or the documenter set in `auto_document` if None.

Brief econometric description

The goal of MachineControl is to estimate:

\[ \tau_t = Y_{1, t}^{I} - Y_{1, t}^{N}, t > T0 \]

where:

  • \(\tau\) is the effect on entity \(i=1\) of the intervention of interest

  • without loss of generality, \(i=1\) is an entity that has undergone the intervention of interest, amongst \(N\) total entities

  • time period \(T0\) is a date in which the intervention occurred

  • superscript \(I\) in an outcome variable denotes the occurence of the intervention, whereas superscript \(N\) is absence of intervention

  • for \(t > T0\), \(Y_{i, t}^{I}\) is observed while \(Y_{i, t}^{N}\) must be estimated because it is a counterfacual.

\(Y_{i, t}^{N}\) is calculated from the values of the other entities, \(i \neq 1\). Collect this data in a vector \(\mathbb{Y}_{-1, t}^{N}\). Then, following Doudchenko and Imbens (2016):

\[ \hat{Y}_{i, t}^{N} = f^*(\mathbb{Y}_{-1, t}^{N}), \]

with the star (\(*\)) superscript on the function \(f(\cdot)\) representing that it was trained only with data up until the intervention date. The exact form of \(f(\cdot)\) depends on the argument estimator. A general use estimator is the random forest (Breiman 2001).

The panel data itself might be the whole population in the data, or a subset when using the whole population might be too cumbersome to run analyses (eg, if the data contains too many entities). One way to select this subsample of control units without including subjective judgment in the data is quantitatilve. The control units are selected through a clustering algorithm (argument cluster_arg). One cluster algorithm that can be used is affinity propagation (Frey and Dueck 2007).

To finalise, the quality of the synthetic control can be assessed in many ways. One fully data-driven way to achieve this is by using manifold learning: lower-dimensional embeddings of a higher-dimensional data. A preferred manifold learning algorithm is t-SNE (Van der Maaten and Hinton 2008).

The relative distance between embeddings and the target centre, as well as the control and the target, represent the chance that a better feasible control (either from real or combined) will materialise. The intuition behind this test is:

  • let \(d_{i,j}\) be the Euclidean distance between the embeddings (2d points) of entities \(i\) and \(j\)

  • if only a very small percentage of \(d_{1, j \in (2, ..., N)}\) are lower than \(d_{1, \text{Synthetic control}}\), than the synthetic control produced with \(f(\cdot)\) is indeed a formula that provides one of the best alternative.

Main references:

  • Abadie and Gardeazabal (2003)
  • Abadie, Diamond, and Hainmueller (2010)
  • Abadie, Diamond, and Hainmueller (2015)
  • Doudchenko and Imbens (2016)
  • Abadie (2021)

Example: impact of labour reform on productivity

See Machine controls: Synthetic controls with machine learning.

References

Abadie, Alberto. 2021. “Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects.” Journal of Economic Literature 59 (2): 391–425.
Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105 (490): 493–505.
———. 2015. “Comparative Politics and the Synthetic Control Method.” American Journal of Political Science 59 (2): 495–510.
Abadie, Alberto, and Javier Gardeazabal. 2003. “The Economic Costs of Conflict: A Case Study of the Basque Country.” American Economic Review 93 (1): 113–32.
Athey, Susan, and Guido W. Imbens. 2019. “Machine Learning Methods That Economists Should Know About.” Annual Review of Economics 11 (1): 685–725. https://doi.org/10.1146/annurev-economics-080217-053433.
Barro, Robert J., and Jong-Wha Lee. 1994. “Sources of Economic Growth.” Carnegie-Rochester Conference Series on Public Policy 40: 1–46. https://doi.org/10.1016/0167-2231(94)90002-7.
Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.
Doudchenko, Nikolay, and Guido W Imbens. 2016. “Balancing, Regression, Difference-in-Differences and Synthetic Control Methods: A Synthesis.” National Bureau of Economic Research.
Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 96:226–31. 34.
Frey, Brendan J, and Delbert Dueck. 2007. “Clustering by Passing Messages Between Data Points.” Science 315 (5814): 972–76.
Van der Maaten, Laurens, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.” Journal of Machine Learning Research 9 (11).