GSTCV#

class pybear.model_selection.GSTCV(estimator, param_grid, *, thresholds=None, scoring='accuracy', n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score='raise', return_train_score=False)#

Bases: _GSTCVMixin

Exhaustive cross-validated search over a grid of hyperparameter values and decision thresholds for a binary classifier.

The optimal hyperparameters and decision threshold selected are those that maximize the average score (and minimize the average loss) of the held-out data (test sets).

pybear GSTCV (grid search threshold CV) is intended to closely parallel the interface and user-experience of scikit-learn GridSearchCV. Users who are familiar with that GridSearch implementation should find that GSTCV differs with respect to 4 things:

1) the init parameter thresholds (which can also be passed as a parameter to param_grid)

2) additional columns in the cv_results_ attribute to report the best thresholds for each scorer

3) a new post-run attribute, best_threshold_, which informs about the overall best threshold

4) callables passed to scoring SHOULD NOT be wrapped in ‘make_scorer’ as would be done with GridSearchCV. Pass scoring callables in raw metric form. See the scikit-learn docs and the ‘Parameters’ section of the GSTCV docs for more information about ‘make_scorer’ and ‘metrics’.

Users who are familiar with the scikit implementation of GridSearch should focus on the above 4 areas in the GSTCV ‘Parameters’ and ‘Attributes’ sections of the docs for mastery of GSTCV.

GSTCV implements fit, predict_proba, predict, score, get_params, and set_params methods. It also implements decision_function, predict_log_proba, score_samples, transform and inverse_transform if they are exposed by the classifier passed to estimator.

Parameters:
estimatorobject

Required. Must be a binary classifier that conforms to the scikit-learn estimator API interface. The classifier must have the fit, set_params, get_params, and predict_proba methods. If the classifier does not have predict_proba, try to wrap with CalibratedClassifierCV. The classifier does not need a score method, as GSTCV never accesses the estimator score method because it always uses a 0.5 threshold.

GSTCV deliberately blocks dask classifiers (including, but not limited to, dask_ml, xgboost, and lightGBM dask classifiers.) To use dask classifiers, use pybear-dask`GSTCVDask`.

param_gridParamGridInputType | ParamGridsInputType

Required. A dictionary with hyperparameters names (str) as keys and list-likes of respective settings to try as values. Can also be a list-like of such dictionaries, and the grids spanned by each are explored. This enables searching over any combination of hyperparameter settings.

thresholdsThresholdsInputType, default=None

The decision threshold search grid to use when performing the hyperparameter search. Other GridSearchCV modules only use the conventional decision threshold for binary classifiers, which is 0.5. This module can search over any set of decision threshold values in the 0 to 1 interval (inclusive) in the same manner as any other hyperparameter while performing the grid search.

The thresholds value passed via the init parameter can be None, a single number from 0 to 1 (inclusive) or a list-like of such numbers. If None, (and thresholds are not passed directly inside the param grid(s)), the default threshold grid is used, numpy.linspace(0, 1, 21).

Thresholds may also be passed to individual param grids via a ‘thresholds’ key. However, when passed directly to a param grid, ‘thresholds’ cannot be None or a single number, it must be a list-like of numbers as is normally done with param grids.

Because thresholds can be passed in 2 different ways, there is a hierarchy that dictates which thresholds are used during searching and scoring. Any threshold values passed directly within a param grid always supersede any passed (or not passed) to the thresholds init parameter. When a param grid does not have any thresholds passed inside it, the value passed to the init parameter is used – if a value was not passed to the init parameter, then the GSTCV default value is used.

When one scorer is used, the best threshold is always exposed and is accessible via the best_threshold_ attribute. When multiple scorers are used, the best_threshold_ attribute is only exposed when a string value is passed to refit. The best threshold is never reported in the best_params_ attribute, even if thresholds were passed via a param grid; the best threshold is only available via the best_threshold_ attribute. Another way to discover the best threshold for each scorer is by inspection of the cv_results_ attribute.

The scores reported for test data in cv_results_ are those for the best threshold. Also note that when return_train_score is True, the scores returned for the train data are only for the best threshold found for the test data. That is, every threshold is scored on the test data, the best score is found, and the best threshold is the threshold corresponding to the best score. That best score and best threshold is reported for the test data. Then when scoring train data, only the best threshold is scored and reported in cv_results_.

scoringScorerInputType, default=’accuracy’

Strategy to evaluate the performance of the cross-validated model on the test set (and also train set, if return_train_score is True.)

For any number of scorers, scoring can be a dictionary with user-assigned scorer names as keys and callables as values. See below for clarification on allowed callables.

For a single scoring metric, a single string or a single callable is also allowed. Valid strings that can be passed are ‘accuracy’, ‘balanced_accuracy’, ‘average_precision’, ‘f1’, ‘precision’, and ‘recall’.

For evaluating multiple metrics, scoring can be a vector-like of unique strings, containing a combination of the allowed strings.

The default scorer of the estimator cannot be used by this module because the decision threshold cannot be manipulated. Therefore, scoring cannot accept a None argument.

About the scorer callable: This module’s scorers differ from other GSCV implementations in an important way. Some of those implementations accept ‘make_scorer’ functions, e.g. sklearn.metrics.make_scorer, but this module cannot accept this. ‘make_scorer’ implicitly assumes a decision threshold of 0.5, but this module needs to be able to calculate predictions based on any user-entered threshold. Therefore, in place of ‘make_scorer’ functions, this module uses scoring metrics directly (whereas they would otherwise be passed to ‘make_scorer’.) An example of a valid scoring metric is sklearn.metrics.accuracy_score.

Additionally, this module can accept any scoring function that has signature (y_true, y_pred) and returns a single number. Note that, when using a custom scorer, the scorer should return a single value. Metric functions returning a list/array of values can be wrapped in multiple scorers that return one value each.

This module cannot directly accept scorer kwargs and pass them to scorers. To pass kwargs to your scoring metric, create a wrapper with signature (y_true, y_pred) around the metric and hard-code the kwargs into the metric, e.g.,

def your_metric_wrapper(y_true, y_pred):

return your_metric(y_true, y_pred, **hard_coded_kwargs)

n_jobsint | None, default=None

Number of jobs to run in parallel. -1 means using all processors. For best speed benefit, pybear recommends setting n_jobs in both GSTCV and the wrapped estimator to None, whether under a joblib context manager or standing alone. When under a joblib context manager, also set n_jobs in the context manager to None.

refitbool | str | Callable, default=True

After the grid search is done, fit the whole dataset on the estimator using the best found hyperparameters and expose this fitted estimator via the best_estimator_ attribute. Also, when the estimator is refit on the best hyperparameters, the GSTCV instance itself becomes the best estimator, exposing the predict_proba(), predict(), and score() methods (and possibly others.) When refit is not performed, the search simply finds the best hyperparameters and exposes them via the best_params_ attribute (unless there are multiple scorers and refit is False, in which case information about the grid search is only available via the cv_results_ attribute.)

The values accepted by refit depend on the scoring scheme, that is, whether a single or multiple scorers are used. In all cases, refit can be boolean False (to disable refit), a string that indicates the name of the scorer to use (when there is only one scorer there is only one possible string value), or a callable. See below for more information about the refit callable. When one scorer is used, refit can be boolean True or False, but boolean True cannot be used when there is more than one scorer.

Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function that takes in cv_results_ and returns best_index_ (an integer). In that case, the returned best_index_ sets both the best_estimator_ and best_params_ attributes. When more than one scorer is used, the best_score_ and best_threshold_ attributes will not be available, but are available if there is only one scorer.

See the scoring parameter to know more about multiple metric evaluation.

cvint | Iterable | None, default=None

Sets the cross-validation splitting strategy.

Possible inputs for cv are:

None, to use the default 5-fold cross validation,

an integer, must be 2 or greater, to specify the number of folds in a StratifiedKFold split,

an iterable yielding pairs of (train, test) split indices as arrays.

For passed iterables: This module will convert generators to lists. No validation is done beyond verifying that it is an iterable that contains pairs of iterables. GSTCV will catch out of range indices and raise an error but any validation beyond that is up to the user outside of GSTCV.

verbosenumbers.Real, default=0

The amount of verbosity to display to screen during the grid search. Accepts integers from 0 to 10. 0 means no information displayed to the screen, 10 means full verbosity. Non-numbers are rejected. Boolean False is set to 0, boolean True is set to 10. Negative numbers are rejected. Numbers greater than 10 are set to 10. Floats are rounded to integers.

pre_dispatchPreDispatchType, default=’2*n_jobs’

The number of batches (of tasks) to be pre-dispatched. Default is ‘2*n_jobs’. See the joblib.Parallel docs for more information.

error_scoreErrorScoreType, default=’raise’

Score to assign if the estimator raises an error while fitting on a train fold. If set to ‘raise’, the error is raised. If a numeric value is given, a warning is raised and the error score value is inserted into the subsequent calculations in place of the missing value(s). This parameter does not affect the refit step, which will always raise the error.

return_train_scorebool

If False, the cv_results_ attribute will not include training scores. If True, the train data is scored using all the scorers at the best respective threshold(s) found for the test data. Computing training scores is used to assess conditions of under- or over-fittedness. However, computing the scores on the training set can be computationally expensive and is not required to select the hyperparameters that yield the best performance.

Attributes:
cv_results_CVResultsType

A dictionary with column headers as keys and results as values, that can be conveniently converted into a pandas DataFrame. Always exposed after fit.

Below is an example of cv_results_ for a logistic classifier, with:

cv=3,

param_grid={‘C’: [1e-5, 1e-4]},

thresholds=np.linspace(0,1,21),

scoring=[‘accuracy’, ‘balanced_accuracy’]

return_train_score=False

on random data.

{

‘mean_fit_time’: [1.227847, 0.341168]

‘std_fit_time’: [0.374309, 0.445982]

‘mean_score_time’: [0.001638, 0.001676]

‘std_score_time’: [0.000551, 0.000647]

‘param_C’: [0.00001, 0.0001]

‘params’: [{‘C’: 1e-05}, {‘C’: 0.0001}]

‘best_threshold_accuracy’: [0.5, 0.51]

‘split0_test_accuracy’: [0.785243, 0.79844]

‘split1_test_accuracy’: [0.80228, 0.814281]

‘split2_test_accuracy’: [0.805881, 0.813381]

‘mean_test_accuracy’: [0.797801, 0.808701]

‘std_test_accuracy’: [0.009001, 0.007265]

‘rank_test_accuracy’: [2, 1]

‘best_threshold_balanced_accuracy’: [0.5, 0.51]

‘split0_test_balanced_accuracy’: [0.785164, 0.798407]

‘split1_test_balanced_accuracy’: [0.802188, 0.814252]

‘split2_test_balanced_accuracy’: [0.805791, 0.813341]

‘mean_test_balanced_accuracy’: [0.797714, 0.808667]

‘std_test_balanced_accuracy’: [0.008995, 0.007264]

‘rank_test_balanced_accuracy’: [2, 1]

}

Slicing across the dictionary values gives the results for a single permutation of grid search. That is, indexing into all of the masked arrays at position zero gives the result for the first permutation of the grid search, index 1 contains the results for the second permutation, and so forth.

The key ‘params’ is used to store a list of parameter settings dictionaries for all the hyperparameter candidates. That is, the ‘params’ key holds all the possible permutations of hyperparameters for the given search grid(s).

‘mean_fit_time’, ‘std_fit_time’, ‘mean_score_time’ and ‘std_score_time’ are all in seconds.

For single-metric evaluation, the scores for the single scorer are available in the cv_results_ dict at the keys ending with ‘_score’. For multi-metric evaluation, the scores for all the scorers are available in the cv_results_ dict at the keys ending with that scorer’s name (‘_<scorer_name>’). E.g., ‘split0_test_precision’, ‘mean_train_precision’, etc.

best_estimator_object

The estimator that was chosen by the search, i.e. the estimator which gave the highest score (or smallest loss) on the held-out (test) data. Only exposed when refit is not False; see the refit parameter for more information on allowed values.

best_score_float

The mean of the scores of the hold out (test) cv folds for the best estimator with the best threshold applied. Always exposed when there is one scorer, or when refit is specified as a string for 2+ scorers.

best_params_dict[str, Any]

The dictionary found in the cv_results_ ‘params’ column in the best_index_ position, which gives the hyperparameter settings that resulted in the highest mean score (best_score_) on the hold out (test) data with the best threshold applied.

best_params_ never holds the best threshold. Access the best threshold via the best_threshold_ attribute (if available) or the cv_results_ attribute.

best_params_ is always exposed when there is one scorer, or when refit is not False for 2+ scorers.

best_index_int

The index of the cv_results_ arrays which corresponds to the best hyperparameter settings. Always exposed when there is one scorer, or when refit is not False for 2+ scorers.

scorer_dict

Scorer metric(s) used on the held out data to choose the best hyperparameters for the model. Always exposed after fit.

This attribute holds the validated scoring dictionary which maps the scorer key to the scorer metric callable, i.e., a dictionary of {scorer_name: scorer_metric}.

n_splits_int

The number of cross-validation splits (folds). Always exposed after fit.

refit_time_float

Seconds elapsed when refitting the best model on the whole dataset. Only exposed when refit is not False.

multimetric_bool

Whether several scoring metrics were used. False if one scorer was used, otherwise True. Always exposed after fit.

classes_

Class labels.

n_features_in_

Number of features seen during fit.

feature_names_in_

Feature names seen during fit.

best_threshold_float

The threshold that, along with the hyperparameter values found in best_params_, yields the highest score for the given estimator and data.

The best threshold is only available conditionally via the best_threshold_ attribute. When one scorer is used, the best threshold found is always exposed. When multiple scorers are used, the best_threshold_ attribute is only exposed when a string value is passed to refit. Another way to find the best threshold for each scorer and overall is by inspection of cv_results_.

Methods

Notes

Type Aliases

class ClassifierProtocol(Protocol):

def fit(self, X: Any, y: Any) -> Self

def get_params(self, **kwargs) -> dict[str, Any]

def set_params(self, **kwargs) -> Self

def predict_proba(self, X: Any) -> Any

ParamGridInputType:

dict[str, Sequence[Any]]

ParamGridsInputType:

Sequence[ParamGridInputType]

ThresholdsInputType:

None | numbers.Real | Sequence[numbers.Real]

SKSlicerType:

Sequence[int]

SKKFoldType:

tuple[SKSlicerType, SKSlicerType]

ScorerNameTypes:
Literal[

‘accuracy’, ‘balanced_accuracy’, ‘average_precision’, ‘f1’, ‘precision’, ‘recall’

]

ScorerCallableType:

Callable[[Iterable, Iterable], numbers.Real]

ScorerInputType:

ScorerNameTypes | Sequence[ScorerNameTypes] | ScorerCallableType | dict[str, ScorerCallableType]

RefitCallableType:

Callable[[CVResultsType], int]

RefitType:

bool | ScorerNameTypes | RefitCallableType

PreDispatchType:

Literal[‘all’] | str | int

SKXType:

Iterable

SKYType:

Sequence[int]

CVResultsType:

dict[str, np.ma.masked_array[Any]]

FeatureNamesInType:

npt.NDArray[str]

Examples

>>> import numpy as np
>>> from pybear.model_selection import GSTCV
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>>
>>> clf = LogisticRegression(
...     solver='saga',
...     penalty='elasticnet'
... )
>>> X, y = make_classification(n_samples=1000, n_features=5, random_state=19)
>>> param_grid = {
...     'C': np.logspace(-6, -3, 4),
...     'l1_ratio': np.linspace(0, 1, 5)
... }
>>> gstcv = GSTCV(
...     estimator=clf,
...     param_grid=param_grid,
...     thresholds=np.linspace(0, 1, 5),
...     scoring='balanced_accuracy',
...     n_jobs=-1,
...     refit=False,
...     cv=5,
...     verbose=0,
...     pre_dispatch='2*n_jobs',
...     error_score='raise',
...     return_train_score=False
... )
>>> gstcv.fit(X, y)
GSTCV(cv=5, estimator=LogisticRegression(penalty='elasticnet', solver='saga'),
      n_jobs=-1,
      param_grid={'C': array([1.e-06, 1.e-05, 1.e-04, 1.e-03]),
                  'l1_ratio': array([0.  , 0.25, 0.5 , 0.75, 1.  ])},
      refit=False, scoring='balanced_accuracy',
      thresholds=array([0.  , 0.25, 0.5 , 0.75, 1.  ]))
>>>
>>> gstcv.best_params_
{'C': 0.001, 'l1_ratio': 0.25}
>>>
>>> gstcv.best_threshold_
0.5
property classes_#

Class labels.

Only exposed when refit is not False. Because GSTCV imposes a restriction that y must be binary in [0, 1], this must always return [0, 1].

Returns:
classes_numpy.ndarray[np.int64]

The class labels for the target.

decision_function(X)#

Call decision_function on the estimator with the best parameters.

Only available if refit is not False and the underlying estimator supports decision_function.

Parameters:
Xarray_like, shape (n_samples, n_features)

Must fulfill the input assumptions of the underlying estimator.

Returns:
outAny

The best_estimator_ decision_function method result for X.

property feature_names_in_#

Feature names seen during fit.

Only available when refit is not False and GSTCV was fit on data that exposes feature names.

Returns:
feature_names_in_FeatureNamesInType

The feature names seen at first fit if the data was passed in a container that has a header with valid feature names.

fit(X, y, **fit_params)#

Perform the grid search with the hyperparameter settings in param_grid to find the unique hyperparameter values that maximize score (minimize loss) for the estimator and data being used.

Parameters:
Xarray_like, shape (n_samples, n_features)

The data on which to perform the grid search. Must fulfill the input assumptions of the underlying estimator.

yvector-like, shape (n_samples,) or (n_samples, 1)

The target relative to X. Must be binary in [0, 1]. Must fulfill the input assumptions of the underlying estimator.

**fit_paramsdict[str, Any]

Parameters passed to the fit method of the estimator. If a fit parameter is an array-like whose length is equal to n_samples, then it will be split across CV groups along with X and y. For example, the sample_weight parameter is split because len(sample_weights) = len(X). For array-likes intended to be subject to CV splits, care must be taken to ensure that any such vector is shaped (n_samples, ) or (n_samples, 1), otherwise it will not be split.

For pipelines, fit parameters can be passed to the fit method of any of the steps. Prefix the parameter name with the name of the step, such that parameter p for step s has key s__p.

Returns:
selfobject

The fitted GSTCV instance.

get_metadata_routing()#

get_metadata_routing is not implemented in GSTCV.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:
deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:
paramsdict[str, Any]

Parameter names mapped to their values.

inverse_transform(X)#

Call inverse_transform on the estimator with the best parameters.

Only available if refit is not False and the underlying estimator supports inverse_transform.

Parameters:
Xarray_like

Must fulfill the input assumptions of the underlying estimator.

Returns:
outAny

The best_estimator_ inverse_transform method result for X.

property n_features_in_#

Number of features seen during fit.

Only available when refit is not False.

Returns:
n_features_in_int

The number of features seen in the data at first fit.

predict(X)#

Pass X to predict_proba on the estimator with the best parameters and apply the best threshold to predict the classes for X.

When only one scorer is used, predict is available if refit is not False. When more than one scorer is used, predict is only available if refit is set to a string.

Parameters:
Xarray_like, shape (n_samples, n_features)

Must fulfill the input assumptions of the underlying estimator.

Returns:
outAny

A vector in [0,1] indicating the class label for the examples in X.

predict_log_proba(X)#

Call predict_log_proba on the estimator with the best parameters.

Only available if refit is not False and the underlying estimator supports predict_log_proba.

Parameters:
Xarray_like, shape (n_samples, n_features)

Must fulfill the input assumptions of the underlying estimator.

Returns:
outAny

The best_estimator_ predict_log_proba method result for X.

predict_proba(X)#

Call predict_proba on the estimator with the best parameters.

Only available if refit is not False. The underlying estimator must support this method, as it is a characteristic that is validated.

Parameters:
Xarray_like, shape (n_samples, n_features)

Must fulfill the input assumptions of the underlying estimator.

Returns:
outAny

The best_estimator_ predict_proba_ method result for X.

score(X, y)#

Score the given X and y using the best estimator, best threshold, and the defined scorer.

When there is only one scorer, that is the defined scorer, and if refit is not False, then the score method is available. When there are multiple scorers, the defined scorer is the scorer specified by refit only if refit is set to a string value.

See the documentation for the scoring parameter for information about passing kwargs to the scorer.

Parameters:
Xarray_like, shape (n_samples, n_features)

Must fulfill the input assumptions of the underlying estimator.

yvector-like, shape (n_samples, ) or (n_samples, 1)

The target relative to X. Must be binary in [0, 1].

Returns:
scorefloat

The score for X and y on the best estimator and best threshold using the defined scorer.

score_samples(X)#

Call score_samples on the estimator with the best parameters.

Only available if refit is not False and the underlying estimator supports score_samples.

Parameters:
Xarray_like, shape (n_samples, n_features)

Must fulfill the input assumptions of the underlying estimator.

Returns:
outAny

The best_estimator_ score_samples method result for X.

set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:
**paramsdict[str: Any]

The parameters to be updated and their new values.

Returns:
selfobject

The instance with new parameter values.

transform(X)#

Call transform on the estimator with the best parameters.

Only available if refit is not False and the underlying estimator supports transform.

Parameters:
Xarray_like, shape (n_samples, n_features)

Must fulfill the input assumptions of the underlying estimator.

Returns:
X_trAny

The best_estimator_ transform method result for X.

visualize(*args, **kwargs)#

Call visualize on the estimator with the best parameters.

Only available if refit is not False and the underlying estimator supports visualize.

Parameters:
*argslist[Any]

Positional arguments for the best estimator’s visualize method.

**kwargsdict[str: Any]

Keyword arguments for the best estimator’s visualize method.

Returns:
outAny

The best_estimator_ visualize output.