InterceptManager#
- class pybear.preprocessing.InterceptManager(*, keep='last', equal_nan=True, rtol=1e-05, atol=1e-08)#
Bases:
FeatureMixin,FitTransformMixin,GetParamsMixin,ReprMixin,SetOutputMixin,SetParamsMixinA scikit-style transformer that identifies and manages the constant columns in a dataset.
A dataset may contain columns with constant values for a variety of reasons, some intentional, some circumstantial. The use of a column of constants in a dataset may be a design consideration for some data analytics algorithms, such as multiple linear regression. Therefore, the presence of one such column may be desirable.
The presence of multiple constant columns is generally a degenerate condition. In many data analytics learning algorithms, such a condition can cause convergence problems, inversion problems, or other undesirable effects. The analyst is often forced to address the issue to perform a meaningful analysis of the data.
InterceptManager (IM) has several key characteristics that make it a versatile and powerful tool that can help fix this condition.
IM…
handles numerical and non-numerical data
accepts nan-like values, and has flexibility in dealing with them
has a partial fit method for batch-wise fitting and transforming
has parameters that give flexibility to how ‘constant’ is defined
can remove all, keep one, or append a column of constants to data
The methodology that IM uses to identify a constant column is different for numerical and non-numerical data.
In the simplest situation with non-numerical data where nan-like values are not involved, the computation is simply to determine the number of unique values in the column. If there is only one unique value, then the column is constant.
The computation for numerical columns is slightly more complex. IM calculates the mean of the column then compares it against the individual values via numpy.allclose. allclose has ‘rtol’ and ‘atol’ parameters that give latitude to the definition of ‘equal’. They provide a tolerance window whereby numerical data that are not exactly equal are considered equal if their difference falls within the tolerance. IM affords some flexibility in defining ‘equal’ when identifying constants by providing direct access to the numpy.allclose ‘rtol’ and ‘atol’ parameters via its own, identically named, rtol and atol parameters. IM requires that rtol and atol be non-boolean, non-negative real numbers, in addition to any other restrictions enforced by numpy.allclose. See the numpy docs for clarification of the technical details.
The equal_nan parameter controls how IM handles nan-like values. If equal_nan is True, exclude any nan-like values from the allclose comparison. This essentially assumes that the nan values are equal to the mean of the non-nan values within their column. nan-like values will not in and of themselves cause a column to be considered non-constant when equal_nan is True. If equal_nan is False, IM does not make the same assumption that the nan values are implicitly equal to the mean of the non-nan values, thus making the column not constant. This is in line with the normal numpy handling of nan-like values. See the Notes section below for a discussion on the handling of nan-like values.
IM also has a keep parameter that allows the user to manage the constant columns that are identified. keep accepts several types of arguments. The ‘Keep’ discussion section has a list of all the options that can be passed to keep, what they do, and how to use them.
The
partial_fit(),fit(), andinverse_transform()methods of IM accept data as numpy arrays, pandas dataframes, polars dataframes, and scipy sparse matrices/arrays.transform()andfit_transform()also accept these containers, but the type of output container for these methods can be controlled by theset_output()method. The user can set the type of output container regardless of the type of input container. Output containers available via set_output are numpy arrays, pandas dataframes, and polars dataframes. When set_output is None, the output container is the same as the input, that is, numpy array, pandas or polars dataframe, or scipy sparse matrix/array.The partial_fit method allows for incremental fitting of data. This makes IM suitable for use with packages that do batch-wise fitting and transforming, such as dask_ml via the Incremental and ParallelPostFit wrappers.
There are no safeguards in place to prevent the user from changing the rtol, atol, or equal_nan values between calls to partial_fit. These 3 parameters have strong influence over whether IM classifies a column as constant, and therefore is instrumental in dictating what IM learns during fitting. Changes to these parameters between partial fits can drastically change IM’s understanding of the constant columns in the data versus what would otherwise be learned under constant settings. pybear recommends against this practice, however, it is not strictly blocked.
When performing multiple batch-wise transformations of data, that is, making sequential calls to transform, it is critical that the same column indices be kept / removed at each call. This issue manifests when keep is set to ‘random’; the random index to keep must be the same at all calls to transform, and cannot be dynamically randomized within transform. IM handles this by generating a static random column index to keep at fit time, and does not change this number during transform time. This number is dynamic with each call to partial_fit, and will likely change at each call. Fits performed after calls to transform will change the random index away from that used in the previous transforms, causing IM to perform entirely different transformations than those previously being done. IM cannot block calls to partial_fit after transform has been called, but pybear strongly discourages this practice because the output will be nonsensical. pybear recommends doing all partial fits consecutively, then doing all transformations consecutively.
The ‘keep’ Parameter
IM learns which columns are constant during fitting. At the time of transform, IM applies the instruction given to it via the keep parameter. The keep parameter takes several types of arguments, providing various ways to manage the columns of constants within a dataset. Below is a comprehensive list of all the arguments that can be passed to keep.
- Literal ‘first’:
Retains the constant column left-most in the data (if any) and deletes any others. Must be lower case. Does not except if there are no constant columns.
- Literal ‘last’:
The default setting, keeps the constant column right-most in the data (if any) and deletes any others. Must be lower case. Does not except if there are no constant columns.
- Literal ‘random’:
Keeps a single randomly-selected constant column (if any) and deletes any others. Must be lower case. Does not except if there are no constant columns.
- Literal ‘none’:
Removes all constant columns (if any). Must be lower case. Does not except if there are no constant columns.
- integer:
An integer indicating the column index in the original data to keep, while removing all other columns of constants. IM will raise an exception if this passed index is not a column of constants.
- string:
A string indicating feature name to keep if a container with a header is passed, while deleting all other constant columns. Case sensitive. IM will except if 1) a string is passed that is not an allowed string literal (‘first’, ‘last’, ‘random’, ‘none’) but a valid container is not passed to fit, 2) a valid container is passed to fit but the given feature name is not valid, 3) the feature name is valid but the column is not constant.
- callable(X):
a callable that returns a valid column index when the data is passed to it, indicating the index of the column of constants to keep while deleting all other columns of constants. This enables the analyst to use characteristics of the data being transformed to determine which column of constants to keep. IM passes the data as-is directly to the callable without any preprocessing. The callable needs to operate on the data object directly. IM will except if 1) the callable does not return an integer, 2) the integer returned is out of the range of columns in the passed data, 3) the integer that is returned does not correspond to a constant column. IM does not retain state information about what indices have been returned from the callable during transform. IM cannot catch if the callable is returning different indices for different batches of data within a sequence of calls to transform. When doing multiple batch-wise transforms, it is up to the user to ensure that the callable returns the same index for each call to transform. If the callable returns a different index for any of the batches of data passed in a sequence of transforms then the results will be nonsensical.
- dictionary[str, Any]:
dictionary of {feature name:str, constant value:Any}. A column of constants is appended to the right end of the data, with the constant being the value in the dictionary. The keep dictionary requires a single key:value pair. The key must be a string indicating feature name. This applies to any format of data that is transformed. If the data is a pandas or polars dataframe, then this string will become the feature name of the new constant feature. If the fitted data is a numpy array or scipy sparse, then this column name is ignored. The dictionary value is the constant value for the new feature. This value has only two restrictions: it cannot be a non-string sequence (e.g. list, tuple, etc.) and it cannot be a callable. Essentially, the constant value is restricted to being integer, float, string, or boolean.
When appending a constant value to a pandas dataframe, if the constant is numeric it is appended as numpy.float64; if it is not numeric it is appended as Python object. When appending a constant value to a polars dataframe, if the constant is numeric it is appended as polars.Float64; if it is not numeric it is appended as polars.Object. Otherwise, if the constant is being appended to a numpy array or scipy sparse it will be forced to the dtype of the transformed data (with some caveats.)
When transforming a pandas dataframe or polars dataframe and the new feature name is already a feature in the data, there are two possible outcomes. 1) If the original feature is not constant, the new constant values will overwrite in the old column (generally an undesirable outcome.) 2) If the original feature is constant, the original column will be removed and a new column with the same name will be appended with the new constant values. IM will warn about this condition but not terminate the program. It is up to the user to manage the feature names in this situation.
The
column_mask_attribute is not adjusted for the new feature appended by the keep dictionary (see the discussion on column_mask_.) But the keep dictionary does make adjustment toget_feature_names_out(). Because get_feature_names_out reflects the state of transformed data, and the keep dictionary modifies the data at transform time, get_feature_names_out reflects this modification. column_mask_ is intended to be applied to pre-transform data, therefore that dimensionality is preserved.
To access the keep literals (‘first’, ‘last’, ‘random’, ‘none’), these MUST be passed as lower-case. If a pandas or polars dataframe is fitted and there is a conflict between a literal that has been passed to keep and a feature name, IM will raise because it is not clear to IM whether you want to indicate the literal or the feature name. To afford a little more flexibility with feature names, IM does not normalize case for this parameter. This means that if keep is passed as ‘first’, feature names such as ‘First’, ‘FIRST’, ‘FiRsT’, etc. will not raise, only ‘first’ will.
The only value that removes all constant columns is ‘none’. All other valid arguments for keep leave one column of constants behind and all other constant columns are removed from the dataset. If IM does not find any constant columns, ‘first’, ‘last’, ‘random’, and ‘none’ will not raise an exception. It is like telling IM: “I don’t know if there are any constant columns, but if you find some, then apply this rule.” However, if using an integer, feature name, or callable, IM will raise an exception if it does not find a constant column at that index. It is like telling IM: “I know that this column is constant, and you need to keep it and remove any others.” If IM finds that it is not constant, it will raise an exception because you lied to it.
- Parameters:
- keepKeepType, default=’last’
The strategy for handling the constant columns. See ‘The keep Parameter’ section for a lengthy explanation of the ‘keep’ parameter.
- equal_nanbool, default=True
If equal_nan is True, exclude nan-likes from computations that discover constant columns. This essentially assumes that the nan value would otherwise be equal to the mean of the non-nan values in the same column. If equal_nan is False and any value in a column is nan, do not assume that the nan value is equal to the mean of the non-nan values in the same column, thus making the column non-constant. This is in line with the normal numpy handling of nan values.
- rtolnumbers.Real, default=1e-5
The relative difference tolerance for equality. Must be a non-boolean, non-negative, real number. See numpy.allclose.
- atolnumbers.Real, default=1e-8
The absolute difference tolerance for equality. Must be a non-boolean, non-negative, real number. See numpy.allclose.
- Attributes:
- n_features_in_int
Number of features in the fitted data before transform.
- feature_names_in_numpy.ndarray[object]
The feature names seen during fitting. Only accessible if X is passed to
partial_fit()orfit()as a pandas or polars dataframe that has a header.constant_columns_Get the constant_columns_ attribute.
kept_columns_Get the kept_columns_ attribute.
removed_columns_Get the removed_columns_ attribute.
column_mask_Get the column_mask_ attribute.
Methods
fit(X[, y])Perform a single fitting on a dataset.
fit_transform(X[, y])Fit to data, then transform it.
get_feature_names_out([input_features])Get the feature names for the output of transform.
Get metadata routing is not implemented.
get_params([deep])Get parameters for this instance.
inverse_transform(X[, copy])Revert transformed data back to its original state.
partial_fit(X[, y])Perform incremental fitting on one or more batches of data.
score(X[, y])Dummy method to spoof dask Incremental and ParallelPostFit wrappers.
set_output([transform])Set the output container when the transform and fit_transform methods of the transformer are called.
set_params(**params)Set the parameters of an instance or a nested instance.
transform(X[, copy])Manage the constant columns in X.
See also
numpy.ndarraypandas.DataFramepolars.DataFramescipy.sparsenumpy.allclosenumpy.isclosenumpy.unique
Notes
Concerning the handling of nan-like representations. While IM accepts data in the form of numpy arrays, pandas dataframes, polars dataframes, and scipy sparse matrices/arrays, internally copies are extracted from the source data as numpy arrays (see below for more detail about how scipy sparse is handled.) After the conversion to numpy array and prior to calculating the mean and applying numpy.allclose, IM identifies any nan-like representations in the numpy array and standardizes all of them to numpy.nan. The user needs to be wary that whatever is used to indicate ‘not-a-number’ in the original data must first survive the conversion to numpy array, then be recognizable by IM as nan-like, so that IM can standardize it to numpy.nan. nan-like representations that are recognized by IM include, at least, numpy.nan, pandas.NA, None (of type None, not string ‘None’), and string representations of ‘nan’ (not case sensitive).
Concerning the handling of infinity. IM has no special handling for the various infinity-types, e.g, numpy.inf, -numpy.inf, float(‘inf’), float(‘-inf’), etc. This is a design decision to not force infinity values to numpy.nan. IM falls back to the native handling of these values for Python and numpy. Specifically, numpy.inf==numpy.inf and float(‘inf’)==float(‘inf’).
Concerning the handling of scipy sparse arrays. When searching for constant columns, chunks of columns are converted to dense numpy arrays one chunk at a time. Each chunk is sliced from the data in sparse form and is converted to numpy ndarray via the ‘toarray’ method. This a compromise that causes some memory expansion but allows for efficient handling of constant column calculations that would otherwise involve implicit non-dense values.
Type Aliases
- XContainer:
numpy.ndarray | pandas.DataFrame | polars.DataFrame | ss._csr.csr_matrix | ss._csc.csc_matrix | ss._coo.coo_matrix | ss._dia.dia_matrix | ss._lil.lil_matrix | ss._dok.dok_matrix | ss._bsr.bsr_matrix | ss._csr.csr_array | ss._csc.csc_array | ss._coo.coo_array | ss._dia.dia_array |ss._lil.lil_array | ss._dok.dok_array | ss._bsr.bsr_array
- KeepType:
Literal[‘first’, ‘last’, ‘random’, ‘none’] | dict[str, Any] | int | str | Callable[[XContainer], int]
- ConstantColumnsType:
dict[int, Any]
- KeptColumnsType:
dict[int, Any]
- RemovedColumnsType:
dict[int, Any]
- ColumnMaskType:
numpy.ndarray[bool]
- NFeaturesInType:
int
- FeatureNamesInType:
numpy.ndarray[str]
Examples
>>> from pybear.preprocessing import InterceptManager as IM >>> import numpy as np >>> np.random.seed(99) >>> X = np.random.randint(0, 10, (5, 5)) >>> X[:, 1] = 0 >>> X[:, 2] = 1 >>> print(X) [[1 0 1 8 9] [8 0 1 5 4] [1 0 1 7 1] [1 0 1 4 7] [2 0 1 8 4]] >>> trf = IM(keep='first', equal_nan=True) >>> trf.fit(X) InterceptManager(keep='first') >>> out = trf.transform(X) >>> print(out) [[1 0 8 9] [8 0 5 4] [1 0 7 1] [1 0 4 7] [2 0 8 4]] >>> print(trf.n_features_in_) 5 >>> print(trf.constant_columns_) {1: np.float64(0.0), 2: np.float64(1.0)} >>> print(trf.removed_columns_) {2: np.float64(1.0)} >>> print(trf.column_mask_) [ True True False True True]
- property column_mask_#
Get the column_mask_ attribute.
Indicates which columns of the fitted data are kept (True) and which are removed (False) during transform. When keep is a dictionary, all original constant columns are removed and a new column of constants is appended to the data. This new column is NOT appended to column_mask_. This mask is intended to be applied to data of the same dimension as that seen during fit, and the new column of constants is a feature added after transform.
- Returns:
- column_mask_numpy.ndarray[bool] of shape (n_features,)
Indicates which columns of the fitted data are kept (True) and which are removed (False) during transform.
- property constant_columns_#
Get the constant_columns_ attribute.
A dictionary whose keys are the indices of the constant columns found during fit, indexed by their column location in the original data. The dictionary values are the constant values in those columns. For example, if a dataset has two constant columns, the first in the third index and the constant value is 1, and the other is in the tenth index and the constant value is 0, then constant_columns_ will be {3:1, 10:0}. If there are no constant columns, then constant_columns_ is an empty dictionary.
- Returns:
- constant_columns_dict[int, Any]
A dictionary whose keys are the indices of the constant columns found during fit, indexed by their column location in the original data.
- fit(X, y=None)#
Perform a single fitting on a dataset.
Determine the constant columns in the data.
- Parameters:
- XXContainer of shape (n_samples, n_features)
Required. The data to find constant columns in.
- yAny, default=None
Ignored. The target for the data.
- Returns:
- selfobject
The fitted InterceptManager instance.
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Required. The data.
- yarray_like of shape (n_samples, n_outputs) or (n_samples,)
Optional, default=None. Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_trarray_like of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)#
Get the feature names for the output of transform.
When keep is a dictionary, the appended column of constants is included in the outputted feature name vector.
- Parameters:
- input_featuresSequence[str] | None, default=None
Externally provided feature names for the fitted data, not the transformed data.
- If input_features is None:
if feature_names_in_ is defined, then feature_names_in_ is used as the input features.
if feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
- If input_features is not None:
if feature_names_in_ is not defined, then input_features is used as the input features.
if feature_names_in_ is defined, then input_features must exactly match the features in feature_names_in_.
- Returns:
- feature_names_outFeatureNamesInType
The feature names of the transformed data.
- get_metadata_routing()#
Get metadata routing is not implemented.
- get_params(deep=True)#
Get parameters for this instance.
The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.
- Parameters:
- deepbool, default = True
For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.
For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.
- Returns:
- paramsdict[str, Any]
Parameter names mapped to their values.
- inverse_transform(X, copy=None)#
Revert transformed data back to its original state.
This operation cannot restore any nan-like values that may have been in the original untransformed data.
set_output()does not control the output container here, the output container is always the same as passed.Very little validation is possible to ensure that the passed data is valid for the current state of IM. It is only possible to ensure that the number of columns in the passed data match the number of columns that are expected to be outputted by
transform()for the current state of IM. It is up to the user to ensure the state of IM aligns with the state of the data that is to undergo inverse transform. Otherwise, the output will be nonsensical.- Parameters:
- XXContainer of shape (n_samples, n_transformed_features)
A transformed data set.
- copybool | None, default=None
Whether to make a deepcopy of X before the inverse transform.
- Returns:
- X_invXContainer of shape (n_samples, n_features)
Transformed data reverted to its original untransformed state.
- property kept_columns_#
Get the kept_columns_ attribute.
A subset of the
constant_columns_dictionary, constructed with the same format. This holds the subset of constant columns that are retained in the data. If a constant column is kept, then this contains one key:value pair from constant_columns_. If there are no constant columns or no columns are kept, then this is an empty dictionary. When keep is a dictionary, all the original constant columns are removed and a new constant column is appended to the data. That column is NOT included in kept_columns_.- Returns:
- kept_columns_dict[int, Any]
A subset of the constant_columns_ dictionary, constructed with the same format.
- partial_fit(X, y=None)#
Perform incremental fitting on one or more batches of data.
Determine the constant columns in the data.
- Parameters:
- XXContainer of shape (n_samples, n_features)
Required. Data to find constant columns in.
- yAny, default=None
Ignored. The target for the data.
- Returns:
- selfobject
The fitted InterceptManager instance.
- property removed_columns_#
Get the removed_columns_ attribute.
A subset of the
constant_columns_dictionary, constructed with the same format. This holds the subset of constant columns that are removed from the data. If there are no constant columns or no constant columns are removed, then this is an empty dictionary.- Returns:
- removed_columns_dict[int, Any]
A subset of the constant_columns_ dictionary, constructed with the same format.
- score(X, y=None)#
Dummy method to spoof dask Incremental and ParallelPostFit wrappers.
Verified must be here for dask wrappers.
- Parameters:
- X:Any
The data. Ignored.
- y:Any, default = None
THe target for the data. Ignored.
- Returns:
- None
- set_output(transform=None)#
Set the output container when the transform and fit_transform methods of the transformer are called.
- Parameters:
- transformLiteral[‘default’, ‘pandas’, ‘polars’] | None,
The default value for the transform parameter is None.
Configure the output of transform and fit_transform.
‘default’: Default output format (numpy array)
‘pandas’: pandas dataframe output
‘polars’: polars dataframe output
None: The output container is the same as the given container.
- Returns:
- selfobject
The transformer instance.
- set_params(**params)#
Set the parameters of an instance or a nested instance.
This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).
Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with
get_params().Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.
Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.
The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.
- Parameters:
- **paramsdict[str: Any]
The parameters to be updated and their new values.
- Returns:
- selfobject
The instance with new parameter values.
- transform(X, copy=None)#
Manage the constant columns in X.
Apply the removal criteria given by keep to the constant columns found during fit.
- Parameters:
- XXContainer of shape (n_samples, n_features)
Required. The data to be transformed.
- copybool | None, default=None
Whether to make a deepcopy of X before the transform.
- Returns:
- X_trXContainer of shape (n_samples, n_transformed_features)
The transformed data.