from sklearn.datasets import make_classification
Automatic benchmark model
A Benchmark object has a similar API to a sciki-learn
estimator: you build an instance with the desired arguments, and fit it to the data at a later moment. Benchmarks is a convenience wrapper for reading the training data, passing it through a simplified pipeline consisting of data imputation and a standard scalar, and then the benchmark function calibrated with a grid search.
A gingado
Benchmark object seeks to automatise a significant part of creating a benchmark model. Importantly, the Benchmark object also has a compare
method that helps users evaluate if candidate models are better than the benchmark, and if one of them is, it becomes the new benchmark. This compare
method takes as argument another fitted estimator (which could be itself a solo estimator or a whole pipeline) or a list of fitted estimators.
Benchmarks start with default values that should perform reasonably well in most settings, but the user is also free to choose any of the benchmark’s components by passing as arguments the data split, pipeline, and/or a dictionary of parameters for the hyperparameter tuning.
Base class
gingado
has a ggdBenchmark
base class that contains the basic functionalities for Benchmark objects. It is not meant to be used by itself, but only as a hyperclass for Benchmark objects. gingado
ships with two of these objects that subclass ggdBenchmark
: ClassificationBenchmark
and RegressionBenchmark
. They are both described below in their respective sections.
Users are encouraged to submit a PR with their own benchmark models subclassing ggdBenchmark
.
ggdBenchmark
ggdBenchmark
()
The base class for gingado's Benchmark objects. This class provides the foundational functionality for benchmarking models, including setting up data splitters for time series data, fitting models, and comparing candidate models.
compare
compare
(self, X: 'np.ndarray', y: 'np.ndarray', candidates, ensemble_method='object_default', update_benchmark: 'bool' = True)
Compares the performance of the benchmark model with candidate models. Args: X: Input data of shape (n_samples, n_features). y: Target data of shape (n_samples,) or (n_samples, n_targets). candidates: Candidate estimator(s) for comparison. ensemble_method: Method to combine candidate estimators. Default is 'object_default'. update_benchmark: Whether to update the benchmark with the best performing model. Default is True.
compare_fitted_candidates
compare_fitted_candidates
(self, X, y, candidates, scoring_func)
No documentation available.
document
document
(self, documenter: 'ggdModelDocumentation | None' = None)
Documents the benchmark model using the specified template. Args: documenter: A gingado Documenter or the documenter set in `auto_document`. Default is None.
predict
predict
(self, X, **predict_params)
Note: only available if the benchmark implements this method.
fit_predict
fit_predict
(self, X, y=None, **predict_params)
Note: only available if the benchmark implements this method.
predict_proba
predict_proba
(self, X, **predict_proba_params)
Note: only available if the benchmark implements this method.
predict_log_proba
predict_log_proba
(self, X, **predict_log_proba_params)
Note: only available if the benchmark implements this method.
decision_function
decision_function
(self, X)
Note: only available if the benchmark implements this method.
score
score
(self, X)
Note: only available if the benchmark implements this method.
score_samples
score_samples
(self, X)
Note: only available if the benchmark implements this method.
Classification tasks
The default benchmark for classification tasks is a RandomForestClassifier
object. Its parameters are fine-tuned in each case according to the user’s data.
ClassificationBenchmark
ClassificationBenchmark
(cv=None, default_cv=StratifiedShuffleSplit(n_splits=10, random_state=None, test_size=None, train_size=None), estimator=RandomForestClassifier(oob_score=True), param_grid={'n_estimators': [100, 250], 'max_features': ['sqrt', 'log2', None]}, param_search=<class 'sklearn.model_selection._search.GridSearchCV'>, scoring=None, auto_document=<class 'gingado.model_documentation.ModelCard'>, random_state=None, verbose_grid=False, ensemble_method=<class 'sklearn.ensemble._voting.VotingClassifier'>)
A gingado Benchmark object used for classification tasks
fit
fit
(self, X: 'np.ndarray', y: 'np.ndarray | None' = None)
Fit the ClassificationBenchmark model. Args: X (np.ndarray): Array-like data of shape (n_samples, n_features), representing the input data. y (np.ndarray, optional): Array-like data of shape (n_samples,) or (n_samples, n_targets), representing the target values. Defaults to None. Returns: ClassificationBenchmark: The instance of the model after fitting.
# some mock up data
= make_classification()
X, y
# the gingado benchmark
= ClassificationBenchmark(verbose_grid=2).fit(X, y)
bm
# note that now the `bm` object can be used as an estimator
assert bm.predict(X).shape == y.shape
Fitting 10 folds for each of 6 candidates, totalling 60 fits
[CV] END ................max_features=sqrt, n_estimators=100; total time= 0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time= 0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time= 0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time= 0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time= 0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time= 0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time= 0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time= 0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time= 0.2s
[CV] END ................max_features=sqrt, n_estimators=100; total time= 0.2s
[CV] END ................max_features=sqrt, n_estimators=250; total time= 0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time= 0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time= 0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time= 0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time= 0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time= 0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time= 0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time= 0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time= 0.5s
[CV] END ................max_features=sqrt, n_estimators=250; total time= 0.5s
[CV] END ................max_features=log2, n_estimators=100; total time= 0.2s
[CV] END ................max_features=log2, n_estimators=100; total time= 0.2s
[CV] END ................max_features=log2, n_estimators=100; total time= 0.2s
[CV] END ................max_features=log2, n_estimators=100; total time= 0.2s
[CV] END ................max_features=log2, n_estimators=100; total time= 0.2s
[CV] END ................max_features=log2, n_estimators=100; total time= 0.2s
[CV] END ................max_features=log2, n_estimators=100; total time= 0.2s
[CV] END ................max_features=log2, n_estimators=100; total time= 0.2s
[CV] END ................max_features=log2, n_estimators=100; total time= 0.2s
[CV] END ................max_features=log2, n_estimators=100; total time= 0.2s
[CV] END ................max_features=log2, n_estimators=250; total time= 0.5s
[CV] END ................max_features=log2, n_estimators=250; total time= 0.5s
[CV] END ................max_features=log2, n_estimators=250; total time= 0.5s
[CV] END ................max_features=log2, n_estimators=250; total time= 0.5s
[CV] END ................max_features=log2, n_estimators=250; total time= 0.5s
[CV] END ................max_features=log2, n_estimators=250; total time= 0.5s
[CV] END ................max_features=log2, n_estimators=250; total time= 0.5s
[CV] END ................max_features=log2, n_estimators=250; total time= 0.5s
[CV] END ................max_features=log2, n_estimators=250; total time= 0.5s
[CV] END ................max_features=log2, n_estimators=250; total time= 0.5s
[CV] END ................max_features=None, n_estimators=100; total time= 0.2s
[CV] END ................max_features=None, n_estimators=100; total time= 0.2s
[CV] END ................max_features=None, n_estimators=100; total time= 0.2s
[CV] END ................max_features=None, n_estimators=100; total time= 0.2s
[CV] END ................max_features=None, n_estimators=100; total time= 0.2s
[CV] END ................max_features=None, n_estimators=100; total time= 0.2s
[CV] END ................max_features=None, n_estimators=100; total time= 0.2s
[CV] END ................max_features=None, n_estimators=100; total time= 0.2s
[CV] END ................max_features=None, n_estimators=100; total time= 0.2s
[CV] END ................max_features=None, n_estimators=100; total time= 0.2s
[CV] END ................max_features=None, n_estimators=250; total time= 0.6s
[CV] END ................max_features=None, n_estimators=250; total time= 0.6s
[CV] END ................max_features=None, n_estimators=250; total time= 0.6s
[CV] END ................max_features=None, n_estimators=250; total time= 0.6s
[CV] END ................max_features=None, n_estimators=250; total time= 0.6s
[CV] END ................max_features=None, n_estimators=250; total time= 0.6s
[CV] END ................max_features=None, n_estimators=250; total time= 0.6s
[CV] END ................max_features=None, n_estimators=250; total time= 0.6s
[CV] END ................max_features=None, n_estimators=250; total time= 0.6s
[CV] END ................max_features=None, n_estimators=250; total time= 0.6s
Importantly, gingado
automatically provides some information to help the user documentat the benchmark model. More specifically, ggdBenchmark
objects collect model information and pass it to a dictionary with key info
in a field called model_details
.
bm.model_documentation.show_json()
{'model_details': {'developer': 'Person or organisation developing the model',
'datetime': '2024-06-20 23:11:57 ',
'version': 'Model version',
'type': 'Model type',
'info': {'_estimator_type': 'classifier',
'best_estimator_': RandomForestClassifier(max_features=None, n_estimators=250, oob_score=True),
'best_index_': 5,
'best_params_': {'max_features': None, 'n_estimators': 250},
'best_score_': 0.8800000000000001,
'classes_': array([0, 1]),
'cv_results_': {'mean_fit_time': array([0.19944484, 0.49079413, 0.1985281 , 0.50103121, 0.2273536 ,
0.55590255]),
'std_fit_time': array([0.00253929, 0.00309827, 0.00089202, 0.01321682, 0.00281543,
0.00344071]),
'mean_score_time': array([0.00645785, 0.01405029, 0.00631735, 0.0143662 , 0.00627313,
0.0141314 ]),
'std_score_time': array([0.00055059, 0.00018452, 0.00022965, 0.00041602, 0.00013258,
0.00024788]),
'param_max_features': masked_array(data=['sqrt', 'sqrt', 'log2', 'log2', None, None],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'param_n_estimators': masked_array(data=[100, 250, 100, 250, 100, 250],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'max_features': 'sqrt', 'n_estimators': 100},
{'max_features': 'sqrt', 'n_estimators': 250},
{'max_features': 'log2', 'n_estimators': 100},
{'max_features': 'log2', 'n_estimators': 250},
{'max_features': None, 'n_estimators': 100},
{'max_features': None, 'n_estimators': 250}],
'split0_test_score': array([0.8, 0.8, 0.8, 0.8, 0.8, 0.8]),
'split1_test_score': array([0.8, 0.8, 0.8, 0.8, 0.8, 0.8]),
'split2_test_score': array([0.9, 0.8, 0.8, 0.8, 0.8, 0.8]),
'split3_test_score': array([1., 1., 1., 1., 1., 1.]),
'split4_test_score': array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9]),
'split5_test_score': array([0.8, 0.8, 0.8, 0.8, 0.8, 0.8]),
'split6_test_score': array([0.8, 0.8, 0.8, 0.8, 0.9, 0.9]),
'split7_test_score': array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9]),
'split8_test_score': array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9]),
'split9_test_score': array([0.9, 0.9, 0.9, 0.9, 0.9, 1. ]),
'mean_test_score': array([0.87, 0.86, 0.86, 0.86, 0.87, 0.88]),
'std_test_score': array([0.06403124, 0.0663325 , 0.0663325 , 0.0663325 , 0.06403124,
0.07483315]),
'rank_test_score': array([2, 4, 4, 4, 2, 1], dtype=int32)},
'multimetric_': False,
'n_features_in_': 20,
'n_splits_': 10,
'refit_time_': 0.5836849212646484,
'scorer_': <sklearn.metrics._scorer._PassthroughScorer at 0x1473a7a60>},
'paper': 'Paper or other resource for more information',
'citation': 'Citation details',
'license': 'License',
'contact': 'Where to send questions or comments about the model'},
'intended_use': {'primary_uses': 'Primary intended uses',
'primary_users': 'Primary intended users',
'out_of_scope': 'Out-of-scope use cases'},
'factors': {'relevant': 'Relevant factors',
'evaluation': 'Evaluation factors'},
'metrics': {'performance_measures': 'Model performance measures',
'thresholds': 'Decision thresholds',
'variation_approaches': 'Variation approaches'},
'evaluation_data': {'datasets': 'Datasets',
'motivation': 'Motivation',
'preprocessing': 'Preprocessing'},
'training_data': {'training_data': 'Information on training data'},
'quant_analyses': {'unitary': 'Unitary results',
'intersectional': 'Intersectional results'},
'ethical_considerations': {'sensitive_data': 'Does the model use any sensitive data (e.g., protected classes)?',
'human_life': 'Is the model intended to inform decisions about matters central to human life or flourishing - e.g., health or safety? Or could it be used in such a way?',
'mitigations': 'What risk mitigation strategies were used during model development?',
'risks_and_harms': 'What risks may be present in model usage? Try to identify the potential recipients,likelihood, and magnitude of harms. If these cannot be determined, note that they were considered but remain unknown',
'use_cases': 'Are there any known model use cases that are especially fraught?',
'additional_information': 'If possible, this section should also include any additional ethical considerations that went into model development, for example, review by an external board, or testing with a specific community.'},
'caveats_recommendations': {'caveats': 'For example, did the results suggest any further testing? Were there any relevant groups that were not represented in the evaluation dataset?',
'recommendations': 'Are there additional recommendations for model use? What are the ideal characteristics of an evaluation dataset for this model?'}}
It is also simple to define as benchmark a model that you already fitted and still benefit from the other functionalities provided by Benchmark
class. This can also be done in case you are using a saved version of a fitted model (eg, the model you are using in production) and want to have that as the benchmark.
from sklearn.ensemble import RandomForestClassifier
= RandomForestClassifier().fit(X, y)
forest
=forest)
bm.set_benchmark(estimator
assert forest == bm.benchmark
assert hasattr(bm.benchmark, "predict")
assert bm.predict(X).shape == y.shape
Regression tasks
The default benchmark for regression tasks is a RandomForestRegressor
object. Its parameters are fine-tuned in each case according to the user’s data.
RegressionBenchmark
RegressionBenchmark
(cv=None, default_cv=ShuffleSplit(n_splits=10, random_state=None, test_size=None, train_size=None), estimator=RandomForestRegressor(oob_score=True), param_grid={'n_estimators': [100, 250], 'max_features': ['sqrt', 'log2', None]}, param_search=<class 'sklearn.model_selection._search.GridSearchCV'>, scoring=None, auto_document=<class 'gingado.model_documentation.ModelCard'>, random_state=None, verbose_grid=False, ensemble_method=<class 'sklearn.ensemble._voting.VotingRegressor'>)
A gingado Benchmark object used for regression tasks
fit
fit
(self, X: 'np.ndarray', y: 'np.ndarray | None' = None)
Fit the `RegressionBenchmark` model. Args: X (np.ndarray): Array-like data of shape (n_samples, n_features). y (np.ndarray | None, optional): Array-like data of shape (n_samples,) or (n_samples, n_targets) or None. Defaults to None. Returns: RegressionBenchmark: The instance of the model.
from sklearn.datasets import make_regression
from sklearn.ensemble import AdaBoostRegressor
# some mock up data
= make_regression()
X, y
# the gingado benchmark
= RegressionBenchmark().fit(X, y)
bm
# note that now the `bm` object can be used as an estimator
assert bm.predict(X).shape == y.shape
# the user might also like to set another model as the benchmark
= AdaBoostRegressor().fit(X, y)
adaboost =adaboost)
bm.set_benchmark(estimator
assert adaboost == bm.benchmark
assert hasattr(bm.benchmark, "predict")
assert bm.predict(X).shape == y.shape
Below we compare the benchmark (set above manually to be the adaboost algorithm) with two other candidate models: a Gaussian process and a linear Support Vector Machine (SVM).
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.svm import LinearSVR
= GaussianProcessRegressor().fit(X, y)
gauss_reg = LinearSVR().fit(X, y)
svm_reg
=[gauss_reg, svm_reg]) bm.compare(X, y, candidates
Benchmark updated!
New benchmark:
Pipeline(steps=[('candidate_estimator', LinearSVR())])
Note that when the benchmark object finds a model that performs better than it does, the user is informed that the benchmark is updated and the new benchmark model is shown. This only happens when the argument update_benchmark
is set to True (as default).
Below we can see by how much it outperformed the other candidates, including the previous benchmark model and an ensemble of the previous benchmark and all the candidates. It is also a good opportunity to see how stable the performance of each model was, as judged by the standard deviation of the scores across the validation folds.
'params', 'mean_test_score', 'std_test_score', 'rank_test_score']] pd.DataFrame(bm.benchmark.cv_results_)[[
params | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|
0 | {'candidate_estimator': (DecisionTreeRegressor... | 0.371883 | 0.183710 | 2 |
1 | {'candidate_estimator': GaussianProcessRegress... | -0.157062 | 0.242157 | 4 |
2 | {'candidate_estimator': LinearSVR(), 'candidat... | 0.480159 | 0.114643 | 1 |
3 | {'candidate_estimator': VotingRegressor(estima... | 0.275088 | 0.161351 | 3 |
General comments on benchmarks
Scoring
ClassificationBenchmark
and RegressionBenchmark
use the default scoring method for comparing model alternatives, both during estimation of the benchmark model and when comparing this benchmark with candidate models. Users are encouraged to consider if another scoring method is more suitable for their use case. More information on available scoring methods that are compatible with gingado
Benchmark objects can be found here.
Data split
gingado
benchmarks rely on hyperparameter tuning to discover the benchmark specification that is most likely to perform better with the user data. This tuning in turn depends on a data splitting strategy for the cross-validation. By default, gingado
uses StratifiedShuffleSplit
(in classification problems) or ShuffleSplit
(in regression problems) if the data is not time series and TimeSeriesSplit
otherwise.
The user may overrun these defaults either by directly setting the parameter cv
or default_cv
when instanciating the gingado
benchmark class. The difference is that default_cv
is only used after gingado
checks that the data is not a time series (if a time dimension exists, then TimeSeriesSplit
is used).
= make_classification()
X, y = ClassificationBenchmark(cv=TimeSeriesSplit(n_splits=3)).fit(X, y)
bm_cls assert bm_cls.benchmark.n_splits_ == 3
= make_regression()
X, y = RegressionBenchmark(default_cv=ShuffleSplit(n_splits=7)).fit(X, y)
bm_reg assert bm_reg.benchmark.n_splits_ == 7
Please refer to this page for more information on the different Splitter
classes available on scikit-learn
, and this page for practical advice on how to choose a splitter for data that are not time series. Any one of these objects (or a custom splitter that is compatible with them) can be passed to a Benchmark
object.
Users that wish to use specific parameters should include the actual Splitter
object as the parameter, as done with the n_splits
parameter in the chunk above.
Custom benchmarks
gingado
provides users with two Benchmark
objects out of the box: ClassificationBenchmark
and RegressionBenchmark
, to be used depending on the task at hand. Both classes derive from a base class ggdBenchmark
, which implements methods that facilitate model comparison. Users that want to create a customised benchmark model for themselves have two options:
the simpler possibility is to train the estimator as usual, and then assign the fitted estimator to a
Benchmark
object.if the user wants more control over the fitting process of estimating the benchmark, they can create a class that subclasses from
ggdBenchmark
and either implements customfit
,predict
andscore
methods, or also subclasses fromscikit-learn
’sBaseEstimator
.- In any case, if the user wants the benchmark to automatically detect if the data is a time series and also to document the model right after fitting, the
fit
method should callself._fit
on the data. Otherwise, the user can simply implement any consistent logic in fit as the user sees fit (pun intended).
- In any case, if the user wants the benchmark to automatically detect if the data is a time series and also to document the model right after fitting, the