Automatic benchmark model

Functions to create a relevant, fast and reasonably well-performing benchmark

A Benchmark object has a similar API to a sciki-learn estimator: you build an instance with the desired arguments, and fit it to the data at a later moment. Benchmarks is a convenience wrapper for reading the training data, passing it through a simplified pipeline consisting of data imputation and a standard scalar, and then the benchmark function calibrated with a grid search.

A gingado Benchmark object seeks to automatise a significant part of creating a benchmark model. Importantly, the Benchmark object also has a compare method that helps users evaluate if candidate models are better than the benchmark, and if one of them is, it becomes the new benchmark. This compare method takes as argument another fitted estimator (which could be itself a solo estimator or a whole pipeline) or a list of fitted estimators.

Benchmarks start with default values that should perform reasonably well in most settings, but the user is also free to choose any of the benchmark’s components by passing as arguments the data split, pipeline, and/or a dictionary of parameters for the hyperparameter tuning.

Base class

gingado has a ggdBenchmark base class that contains the basic functionalities for Benchmark objects. It is not meant to be used by itself, but only as a hyperclass for Benchmark objects. gingado ships with two of these objects that subclass ggdBenchmark: ClassificationBenchmark and RegressionBenchmark. They are both described below in their respective sections.

Users are encouraged to submit a PR with their own benchmark models subclassing ggdBenchmark.

ggdBenchmark

ggdBenchmark ()

The base class for gingado's Benchmark objects.

This class provides the foundational functionality for benchmarking models, including
setting up data splitters for time series data, fitting models, and comparing candidate models.

compare

compare (self, X: 'np.ndarray', y: 'np.ndarray', candidates, ensemble_method='object_default', update_benchmark: 'bool' = True)

Compares the performance of the benchmark model with candidate models.

Args:
    X: Input data of shape (n_samples, n_features).
    y: Target data of shape (n_samples,) or (n_samples, n_targets).
    candidates: Candidate estimator(s) for comparison.
    ensemble_method: Method to combine candidate estimators. Default is 'object_default'.
    update_benchmark: Whether to update the benchmark with the best performing model. Default is True.

compare_fitted_candidates

compare_fitted_candidates (self, X, y, candidates, scoring_func)

No documentation available.

document

document (self, documenter: 'ggdModelDocumentation | None' = None)

Documents the benchmark model using the specified template.

Args:
    documenter: A gingado Documenter or the documenter set in `auto_document`. Default is None.

predict

predict (self, X, **predict_params)

Note: only available if the benchmark implements this method.

fit_predict

fit_predict (self, X, y=None, **predict_params)

Note: only available if the benchmark implements this method.

predict_proba

predict_proba (self, X, **predict_proba_params)

Note: only available if the benchmark implements this method.

predict_log_proba

predict_log_proba (self, X, **predict_log_proba_params)

Note: only available if the benchmark implements this method.

decision_function

decision_function (self, X)

Note: only available if the benchmark implements this method.

score

score (self, X)

Note: only available if the benchmark implements this method.

score_samples

score_samples (self, X)

Note: only available if the benchmark implements this method.

Classification tasks

The default benchmark for classification tasks is a RandomForestClassifier object. Its parameters are fine-tuned in each case according to the user’s data.

ClassificationBenchmark

ClassificationBenchmark (cv=None, default_cv=StratifiedShuffleSplit(n_splits=10, random_state=None, test_size=None, train_size=None), estimator=RandomForestClassifier(oob_score=True), param_grid={'n_estimators': [100, 250], 'max_features': ['sqrt', 'log2', None]}, param_search=<class 'sklearn.model_selection._search.GridSearchCV'>, scoring=None, auto_document=<class 'gingado.model_documentation.ModelCard'>, random_state=None, verbose_grid=False, ensemble_method=<class 'sklearn.ensemble._voting.VotingClassifier'>)

A gingado Benchmark object used for classification tasks

fit

fit (self, X: 'np.ndarray', y: 'np.ndarray | None' = None)

Fit the ClassificationBenchmark model.

Args:
    X (np.ndarray): Array-like data of shape (n_samples, n_features), representing the input data.
    y (np.ndarray, optional): Array-like data of shape (n_samples,) or (n_samples, n_targets), representing the target values. Defaults to None.

Returns:
    ClassificationBenchmark: The instance of the model after fitting.
from sklearn.datasets import make_classification
# some mock up data
X, y = make_classification()

# the gingado benchmark
bm = ClassificationBenchmark(verbose_grid=2).fit(X, y)

# note that now the `bm` object can be used as an estimator
assert bm.predict(X).shape == y.shape
Fitting 10 folds for each of 6 candidates, totalling 60 fits
[CV] END ................max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END ................max_features=sqrt, n_estimators=250; total time=   0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time=   0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time=   0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time=   0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time=   0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time=   0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time=   0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time=   0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time=   0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time=   0.5s
[CV] END ................max_features=log2, n_estimators=100; total time=   0.2s
[CV] END ................max_features=log2, n_estimators=100; total time=   0.2s
[CV] END ................max_features=log2, n_estimators=100; total time=   0.2s
[CV] END ................max_features=log2, n_estimators=100; total time=   0.2s
[CV] END ................max_features=log2, n_estimators=100; total time=   0.2s
[CV] END ................max_features=log2, n_estimators=100; total time=   0.2s
[CV] END ................max_features=log2, n_estimators=100; total time=   0.2s
[CV] END ................max_features=log2, n_estimators=100; total time=   0.2s
[CV] END ................max_features=log2, n_estimators=100; total time=   0.2s
[CV] END ................max_features=log2, n_estimators=100; total time=   0.2s
[CV] END ................max_features=log2, n_estimators=250; total time=   0.5s
[CV] END ................max_features=log2, n_estimators=250; total time=   0.5s
[CV] END ................max_features=log2, n_estimators=250; total time=   0.5s
[CV] END ................max_features=log2, n_estimators=250; total time=   0.5s
[CV] END ................max_features=log2, n_estimators=250; total time=   0.5s
[CV] END ................max_features=log2, n_estimators=250; total time=   0.5s
[CV] END ................max_features=log2, n_estimators=250; total time=   0.5s
[CV] END ................max_features=log2, n_estimators=250; total time=   0.5s
[CV] END ................max_features=log2, n_estimators=250; total time=   0.5s
[CV] END ................max_features=log2, n_estimators=250; total time=   0.5s
[CV] END ................max_features=None, n_estimators=100; total time=   0.2s
[CV] END ................max_features=None, n_estimators=100; total time=   0.2s
[CV] END ................max_features=None, n_estimators=100; total time=   0.2s
[CV] END ................max_features=None, n_estimators=100; total time=   0.2s
[CV] END ................max_features=None, n_estimators=100; total time=   0.2s
[CV] END ................max_features=None, n_estimators=100; total time=   0.2s
[CV] END ................max_features=None, n_estimators=100; total time=   0.2s
[CV] END ................max_features=None, n_estimators=100; total time=   0.2s
[CV] END ................max_features=None, n_estimators=100; total time=   0.2s
[CV] END ................max_features=None, n_estimators=100; total time=   0.2s
[CV] END ................max_features=None, n_estimators=250; total time=   0.6s
[CV] END ................max_features=None, n_estimators=250; total time=   0.6s
[CV] END ................max_features=None, n_estimators=250; total time=   0.6s
[CV] END ................max_features=None, n_estimators=250; total time=   0.6s
[CV] END ................max_features=None, n_estimators=250; total time=   0.6s
[CV] END ................max_features=None, n_estimators=250; total time=   0.6s
[CV] END ................max_features=None, n_estimators=250; total time=   0.6s
[CV] END ................max_features=None, n_estimators=250; total time=   0.6s
[CV] END ................max_features=None, n_estimators=250; total time=   0.6s
[CV] END ................max_features=None, n_estimators=250; total time=   0.6s

Importantly, gingado automatically provides some information to help the user documentat the benchmark model. More specifically, ggdBenchmark objects collect model information and pass it to a dictionary with key info in a field called model_details.

bm.model_documentation.show_json()
{'model_details': {'developer': 'Person or organisation developing the model',
  'datetime': '2024-06-20 23:11:57 ',
  'version': 'Model version',
  'type': 'Model type',
  'info': {'_estimator_type': 'classifier',
   'best_estimator_': RandomForestClassifier(max_features=None, n_estimators=250, oob_score=True),
   'best_index_': 5,
   'best_params_': {'max_features': None, 'n_estimators': 250},
   'best_score_': 0.8800000000000001,
   'classes_': array([0, 1]),
   'cv_results_': {'mean_fit_time': array([0.19944484, 0.49079413, 0.1985281 , 0.50103121, 0.2273536 ,
           0.55590255]),
    'std_fit_time': array([0.00253929, 0.00309827, 0.00089202, 0.01321682, 0.00281543,
           0.00344071]),
    'mean_score_time': array([0.00645785, 0.01405029, 0.00631735, 0.0143662 , 0.00627313,
           0.0141314 ]),
    'std_score_time': array([0.00055059, 0.00018452, 0.00022965, 0.00041602, 0.00013258,
           0.00024788]),
    'param_max_features': masked_array(data=['sqrt', 'sqrt', 'log2', 'log2', None, None],
                 mask=[False, False, False, False, False, False],
           fill_value='?',
                dtype=object),
    'param_n_estimators': masked_array(data=[100, 250, 100, 250, 100, 250],
                 mask=[False, False, False, False, False, False],
           fill_value='?',
                dtype=object),
    'params': [{'max_features': 'sqrt', 'n_estimators': 100},
     {'max_features': 'sqrt', 'n_estimators': 250},
     {'max_features': 'log2', 'n_estimators': 100},
     {'max_features': 'log2', 'n_estimators': 250},
     {'max_features': None, 'n_estimators': 100},
     {'max_features': None, 'n_estimators': 250}],
    'split0_test_score': array([0.8, 0.8, 0.8, 0.8, 0.8, 0.8]),
    'split1_test_score': array([0.8, 0.8, 0.8, 0.8, 0.8, 0.8]),
    'split2_test_score': array([0.9, 0.8, 0.8, 0.8, 0.8, 0.8]),
    'split3_test_score': array([1., 1., 1., 1., 1., 1.]),
    'split4_test_score': array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9]),
    'split5_test_score': array([0.8, 0.8, 0.8, 0.8, 0.8, 0.8]),
    'split6_test_score': array([0.8, 0.8, 0.8, 0.8, 0.9, 0.9]),
    'split7_test_score': array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9]),
    'split8_test_score': array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9]),
    'split9_test_score': array([0.9, 0.9, 0.9, 0.9, 0.9, 1. ]),
    'mean_test_score': array([0.87, 0.86, 0.86, 0.86, 0.87, 0.88]),
    'std_test_score': array([0.06403124, 0.0663325 , 0.0663325 , 0.0663325 , 0.06403124,
           0.07483315]),
    'rank_test_score': array([2, 4, 4, 4, 2, 1], dtype=int32)},
   'multimetric_': False,
   'n_features_in_': 20,
   'n_splits_': 10,
   'refit_time_': 0.5836849212646484,
   'scorer_': <sklearn.metrics._scorer._PassthroughScorer at 0x1473a7a60>},
  'paper': 'Paper or other resource for more information',
  'citation': 'Citation details',
  'license': 'License',
  'contact': 'Where to send questions or comments about the model'},
 'intended_use': {'primary_uses': 'Primary intended uses',
  'primary_users': 'Primary intended users',
  'out_of_scope': 'Out-of-scope use cases'},
 'factors': {'relevant': 'Relevant factors',
  'evaluation': 'Evaluation factors'},
 'metrics': {'performance_measures': 'Model performance measures',
  'thresholds': 'Decision thresholds',
  'variation_approaches': 'Variation approaches'},
 'evaluation_data': {'datasets': 'Datasets',
  'motivation': 'Motivation',
  'preprocessing': 'Preprocessing'},
 'training_data': {'training_data': 'Information on training data'},
 'quant_analyses': {'unitary': 'Unitary results',
  'intersectional': 'Intersectional results'},
 'ethical_considerations': {'sensitive_data': 'Does the model use any sensitive data (e.g., protected classes)?',
  'human_life': 'Is the model intended to inform decisions about matters central to human life or flourishing - e.g., health or safety? Or could it be used in such a way?',
  'mitigations': 'What risk mitigation strategies were used during model development?',
  'risks_and_harms': 'What risks may be present in model usage? Try to identify the potential recipients,likelihood, and magnitude of harms. If these cannot be determined, note that they were considered but remain unknown',
  'use_cases': 'Are there any known model use cases that are especially fraught?',
  'additional_information': 'If possible, this section should also include any additional ethical considerations that went into model development, for example, review by an external board, or testing with a specific community.'},
 'caveats_recommendations': {'caveats': 'For example, did the results suggest any further testing? Were there any relevant groups that were not represented in the evaluation dataset?',
  'recommendations': 'Are there additional recommendations for model use? What are the ideal characteristics of an evaluation dataset for this model?'}}

It is also simple to define as benchmark a model that you already fitted and still benefit from the other functionalities provided by Benchmark class. This can also be done in case you are using a saved version of a fitted model (eg, the model you are using in production) and want to have that as the benchmark.

from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier().fit(X, y)

bm.set_benchmark(estimator=forest)

assert forest == bm.benchmark
assert hasattr(bm.benchmark, "predict")
assert bm.predict(X).shape == y.shape

Regression tasks

The default benchmark for regression tasks is a RandomForestRegressor object. Its parameters are fine-tuned in each case according to the user’s data.

RegressionBenchmark

RegressionBenchmark (cv=None, default_cv=ShuffleSplit(n_splits=10, random_state=None, test_size=None, train_size=None), estimator=RandomForestRegressor(oob_score=True), param_grid={'n_estimators': [100, 250], 'max_features': ['sqrt', 'log2', None]}, param_search=<class 'sklearn.model_selection._search.GridSearchCV'>, scoring=None, auto_document=<class 'gingado.model_documentation.ModelCard'>, random_state=None, verbose_grid=False, ensemble_method=<class 'sklearn.ensemble._voting.VotingRegressor'>)

A gingado Benchmark object used for regression tasks

fit

fit (self, X: 'np.ndarray', y: 'np.ndarray | None' = None)

Fit the `RegressionBenchmark` model.

Args:
    X (np.ndarray): Array-like data of shape (n_samples, n_features).
    y (np.ndarray | None, optional): Array-like data of shape (n_samples,) or (n_samples, n_targets) or None. Defaults to None.

Returns:
    RegressionBenchmark: The instance of the model.
from sklearn.datasets import make_regression
from sklearn.ensemble import AdaBoostRegressor
# some mock up data
X, y = make_regression()

# the gingado benchmark
bm = RegressionBenchmark().fit(X, y)

# note that now the `bm` object can be used as an estimator
assert bm.predict(X).shape == y.shape

# the user might also like to set another model as the benchmark
adaboost = AdaBoostRegressor().fit(X, y)
bm.set_benchmark(estimator=adaboost)

assert adaboost == bm.benchmark
assert hasattr(bm.benchmark, "predict")
assert bm.predict(X).shape == y.shape

Below we compare the benchmark (set above manually to be the adaboost algorithm) with two other candidate models: a Gaussian process and a linear Support Vector Machine (SVM).

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.svm import LinearSVR
gauss_reg = GaussianProcessRegressor().fit(X, y)
svm_reg = LinearSVR().fit(X, y)

bm.compare(X, y, candidates=[gauss_reg, svm_reg])
Benchmark updated!
New benchmark:
Pipeline(steps=[('candidate_estimator', LinearSVR())])

Note that when the benchmark object finds a model that performs better than it does, the user is informed that the benchmark is updated and the new benchmark model is shown. This only happens when the argument update_benchmark is set to True (as default).

Below we can see by how much it outperformed the other candidates, including the previous benchmark model and an ensemble of the previous benchmark and all the candidates. It is also a good opportunity to see how stable the performance of each model was, as judged by the standard deviation of the scores across the validation folds.

pd.DataFrame(bm.benchmark.cv_results_)[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']]
params mean_test_score std_test_score rank_test_score
0 {'candidate_estimator': (DecisionTreeRegressor... 0.371883 0.183710 2
1 {'candidate_estimator': GaussianProcessRegress... -0.157062 0.242157 4
2 {'candidate_estimator': LinearSVR(), 'candidat... 0.480159 0.114643 1
3 {'candidate_estimator': VotingRegressor(estima... 0.275088 0.161351 3

General comments on benchmarks

Scoring

ClassificationBenchmark and RegressionBenchmark use the default scoring method for comparing model alternatives, both during estimation of the benchmark model and when comparing this benchmark with candidate models. Users are encouraged to consider if another scoring method is more suitable for their use case. More information on available scoring methods that are compatible with gingado Benchmark objects can be found here.

Data split

gingado benchmarks rely on hyperparameter tuning to discover the benchmark specification that is most likely to perform better with the user data. This tuning in turn depends on a data splitting strategy for the cross-validation. By default, gingado uses StratifiedShuffleSplit (in classification problems) or ShuffleSplit (in regression problems) if the data is not time series and TimeSeriesSplit otherwise.

The user may overrun these defaults either by directly setting the parameter cv or default_cv when instanciating the gingado benchmark class. The difference is that default_cv is only used after gingado checks that the data is not a time series (if a time dimension exists, then TimeSeriesSplit is used).

X, y = make_classification()
bm_cls = ClassificationBenchmark(cv=TimeSeriesSplit(n_splits=3)).fit(X, y)
assert bm_cls.benchmark.n_splits_ == 3

X, y = make_regression()
bm_reg = RegressionBenchmark(default_cv=ShuffleSplit(n_splits=7)).fit(X, y)
assert bm_reg.benchmark.n_splits_ == 7

Please refer to this page for more information on the different Splitter classes available on scikit-learn, and this page for practical advice on how to choose a splitter for data that are not time series. Any one of these objects (or a custom splitter that is compatible with them) can be passed to a Benchmark object.

Users that wish to use specific parameters should include the actual Splitter object as the parameter, as done with the n_splits parameter in the chunk above.

Custom benchmarks

gingado provides users with two Benchmark objects out of the box: ClassificationBenchmark and RegressionBenchmark, to be used depending on the task at hand. Both classes derive from a base class ggdBenchmark, which implements methods that facilitate model comparison. Users that want to create a customised benchmark model for themselves have two options:

  • the simpler possibility is to train the estimator as usual, and then assign the fitted estimator to a Benchmark object.

  • if the user wants more control over the fitting process of estimating the benchmark, they can create a class that subclasses from ggdBenchmark and either implements custom fit, predict and score methods, or also subclasses from scikit-learn’s BaseEstimator.

    • In any case, if the user wants the benchmark to automatically detect if the data is a time series and also to document the model right after fitting, the fit method should call self._fit on the data. Otherwise, the user can simply implement any consistent logic in fit as the user sees fit (pun intended).