ColumnDeduplicator#

class pybear.preprocessing.ColumnDeduplicator(*, keep='first', do_not_drop=None, conflict='raise', equal_nan=False, rtol=1e-05, atol=1e-08, n_jobs=None, job_size=50)#

Bases: FeatureMixin, FitTransformMixin, GetParamsMixin, ReprMixin, SetOutputMixin, SetParamsMixin

ColumnDeduplicator (CDT) is a scikit-style transformer that removes duplicate columns from data, leaving behind one column out of a set of duplicate columns.

Duplicate columns are a point of concern for analysts. In many data analytics learning algorithms, such a condition can cause convergence problems, inversion problems, or other undesirable effects. The analyst is often forced to address this issue to perform a meaningful analysis of the data.

Columns with identical values within the same dataset may occur coincidentally in a sampling of data, during one-hot encoding of categorical data, or during polynomial feature expansion.

CDT is a tool that can help fix this problem. CDT identifies duplicate columns and selectively keeps one from a group of duplicates based on the configuration set by the user.

CDT affords parameters that give some flexibility to the definition of ‘equal’ for the sake of identifying duplicates. Namely, the rtol, atol, and equal_nan parameters.

The rtol and atol parameters provide a tolerance window whereby numerical data that are not exactly equal are considered equal if their difference falls within the tolerance. See the numpy docs for clarification of the technical details. CDT requires that rtol and atol be non-boolean, non-negative real numbers, in addition to any other restrictions enforced by numpy.allclose.

The equal_nan parameter controls how CDT handles nan-like representations during comparisons. If equal_nan is True, exclude from comparison any rows where one or both of the values is/are nan. If one value is nan, this essentially assumes that the nan value would otherwise be the same as its non-nan counterpart. When both are nan, this considers the nans as equal (contrary to the default numpy handling of nan, where numpy.nan does not equal numpy.nan) and will not in and of itself cause a pair of columns to be marked as unequal. If equal_nan is False and either one or both of the values in the compared pair of values is/are nan, consider the pair to be not equivalent, thus making the column pair not equal. This is in line with the normal numpy handling of nan values. See the Notes section below for a discussion on the handling of nan-like values.

CDT has parameters that allow the user to control which column is retained out of a set of duplicates: keep, do_not_drop, and conflict.

The keep parameter sets the strategy for keeping a single representative from a set of identical columns. It accepts one of three values: ‘first’, ‘last’, or ‘random’. The default setting is ‘first’. ‘first’ retains the column left-most in the data; ‘last’ keeps the column right-most in the data; ‘random’ keeps a single randomly-selected column from the set of duplicates. All other columns in the set of duplicates are removed from the dataset.

The do_not_drop parameter allows the user to indicate columns not to be removed from the data. This is to be given as a list-like of integers or strings. If fitting is done with a data container that has a header (such as pandas or polars dataframes), a list of feature names may be provided. The values within must exactly match the features as named in the dataframe header (case-sensitive.) Otherwise, a list of column indices must be provided. The do_not_drop instructions could conflict with the keep instructions. If a conflict arises, such as two columns specified in do_not_drop are duplicates of each other, the behavior is managed by conflict.

conflict is ignored when do_not_drop is not passed. Otherwise, conflict accepts two possible values: ‘raise’ or ‘ignore’. This parameter instructs CDT how to deal with conflict between keep and do_not_drop. A conflict arises when the instruction in keep (‘first’, ‘last’, ‘random’) is applied and a column in do_not_drop is found to be a member of the columns to be removed. In this case, an exception is raised when conflict is ‘raise’. But when conflict is ‘ignore’, there are 2 possible scenarios:

1) when only one column in do_not_drop is among the columns to be removed, the keep instruction is overruled and the do-not-drop column is kept.

2) when multiple columns in do_not_drop are among the columns to be removed, the keep instruction (‘first’, ‘last’, ‘random’) is applied to that subset of do-not-drop columns — this may not give the same result as applying the keep instruction to the entire set of duplicate columns. This also causes at least one member of the columns not to be dropped to be removed.

The partial_fit(), fit(), and inverse_transform() methods of CDT accept data as numpy arrays, pandas dataframes, polars dataframes, and scipy sparse matrices/arrays. inverse_transform always returns output in the same type of container as passed to it. The transform() and fit_transform() methods can take all the containers listed above but can return output in a variety of containers. CDT has a set_output() method, whereby the user can set the type of output container for these two methods regardless of the type of container the data is in when passed. set_output can return transformed outputs as numpy arrays, pandas dataframes, or polars dataframes. When set_output is None, the output container is the same as the input, that is, numpy array, pandas or polars dataframe, or scipy sparse matrix/array.

The partial_fit method allows for incremental fitting of data. This makes CDT suitable for use with packages that do batch-wise fitting and transforming, such as dask_ml via the Incremental and ParallelPostFit wrappers.

There are no safeguards in place to prevent the user from changing the rtol, atol, or equal_nan parameters between calls to partial_fit. These 3 parameters have strong influence over whether CDT classifies two columns as equal, and therefore are instrumental in dictating what CDT learns during fitting. Changes to these parameters between partial fits can drastically change CDT’s understanding of the duplicate columns in the data versus what would otherwise be learned under constant settings. pybear recommends against this practice, however, it is not strictly blocked.

When performing multiple batch-wise transformations of data, that is, making sequential calls to transform, it is critical that the same column indices be kept / removed at each call. This issue manifests when keep is set to ‘random’; the random indices to keep must be the same at all calls to transform, and cannot be dynamically randomized within transform. CDT handles this by generating a static list of random indices to keep at fit time, and does not mutate this list during transform time. This list is dynamic with each call to partial_fit, and will likely change at each call. Fits performed after calls to transform will change the random indices away from those used in the previous transforms, causing CDT to perform entirely different transformations than those previously being done. CDT cannot block calls to partial_fit after transform has been called, but pybear strongly discourages this practice because the output will be nonsensical. pybear recommends doing all partial fits consecutively, then doing all transformations consecutively.

Parameters:

keepLiteral[‘first’, ‘last’, ‘random’], default=’first’

The strategy for keeping a single representative from a set of identical columns. ‘first’ retains the column left-most in the data; ‘last’ keeps the column right-most in the data; ‘random’ keeps a single randomly-selected column of the set of duplicates.

do_not_dropSequence[int] | Sequence[str] | None, default=None

A list of columns not to be dropped. If fitting is done with a container that has a header, a list of feature names may be provided. Otherwise, a list of column indices must be given. If a conflict arises, such as when two columns specified in do_not_drop are duplicates of each other, the behavior is managed by conflict.

conflictLiteral[‘raise’, ‘ignore’], default = ‘raise’

Ignored when do_not_drop is not passed. Instructs CDT how to deal with a conflict between the instructions in keep and do_not_drop. A conflict arises when the instruction in keep (‘first’, ‘last’, ‘random’) is applied and a column in do_not_drop is found to be a member of the columns to be removed. In this case, when conflict is ‘raise’, an exception is raised. When conflict is ‘ignore’, there are 2 possible scenarios:

1) when only one column in do_not_drop is among the columns to be removed, the keep instruction is overruled and the do-not-drop column is kept.

2) when multiple columns in do_not_drop are among the columns to be removed, the keep instruction (‘first’, ‘last’, ‘random’) is applied to the set of do-not-delete columns that are amongst the duplicates — this may not give the same result as applying the keep instruction to the entire set of duplicate columns. This also causes at least one member of the columns not to be dropped to be removed.

equal_nanbool, default=False

When comparing pairs of columns row by row:

If equal_nan is True, exclude from comparison any rows where one or both of the values is/are nan. If one value is nan, this essentially assumes that the nan value would otherwise be the same as its non-nan counterpart. When both are nan, this considers the nans as equal (contrary to the default numpy handling of nan, where numpy.nan does not equal numpy.nan) and will not in and of itself cause a pair of columns to be marked as unequal. If equal_nan is False and either one or both of the values in the compared pair of values is/are nan, consider the pair to be not equivalent, thus making the column pair not equal. This is in line with the normal numpy handling of nan values.

rtolnumbers.Real, default=1e-5

The relative difference tolerance for equality. Must be a non-boolean, non-negative, real number. See numpy.allclose.

atolnumbers.Real, default=1e-8

The absolute difference tolerance for equality. Must be a non-boolean, non-negative, real number. See numpy.allclose.

n_jobsint | None, default=None

The number of joblib Parallel jobs to use when comparing columns. The default is to use processes, but can be overridden externally using a joblib parallel_config context manager. The default value for n_jobs is None, which uses the joblib default setting. To get maximum speed benefit, pybear recommends setting this to -1, which means use all processors.

job_sizeint, default=50

The number of columns to send to a joblib job. Must be an integer greater than or equal to 2. This allows the user to optimize CPU utilization for their particular circumstance. Long, thin datasets should use fewer columns, and wide, flat datasets should use more columns. Bear in mind that the columns sent to joblib jobs are deep copies of the original data, and larger job sizes increase RAM usage. Also note that joblib is only engaged when the number of columns in the data is at least 2*job_size. For example, if job_size is 10, data with 20 or more columns will be processed with joblib, data with 19 or fewer columns will be processed linearly.

Attributes:

n_features_in_int: The number of features in the fitted data before deduplication.
feature_names_in_numpy.ndarray[object]: The names of the features as seen during fitting. Only accessible if X is passed to partial_fit() or fit() in a container that has a header.
duplicates_: Get the duplicates_ attribute.
removed_columns_: Get the removed_columns_ attribute.
column_mask_: Get the column_mask_ attribute.

Methods

`fit`(X[, y])	Perform a single fitting on a dataset.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Get the feature names for the output of transform.
`get_metadata_routing`()	Get metadata routing is not implemented.
`get_params`([deep])	Get parameters for this instance.
`inverse_transform`(X[, copy])	Revert deduplicated data back to its original state.
`partial_fit`(X[, y])	Perform incremental fitting on one or more batches of data.
`score`(X[, y])	Dummy method to spoof dask Incremental and ParallelPostFit wrappers.
`set_output`([transform])	Set the output container when the transform and fit_transform methods of the transformer are called.
`set_params`(**params)	Set the parameters of an instance or a nested instance.
`transform`(X[, copy])	Remove the duplicate columns from X.

See also

numpy.ndarray
pandas.DataFrame
polars.DataFrame
scipy.sparse
numpy.allclose
numpy.array_equal

Notes

Concerning the handling of nan-like representations. While CDT accepts data in the form of numpy arrays, pandas dataframes, polars dataframes, and scipy sparse matrices/arrays, at comparison time the two columns of data to be compared are extracted from the passed data and converted to numpy arrays. After the conversion and prior to the comparison, CDT identifies any nan-like representations in both numpy arrays and standardizes all of them to numpy.nan. The user needs to be wary that whatever is used to indicate ‘not-a-number’ in the original data must first survive the conversion to numpy array, then be recognizable by CDT as nan-like, so that CDT can standardize it to numpy.nan. nan-like representations that are recognized by CDT include, at least, numpy.nan, pandas.NA, None (of type None, not string ‘None’), and string representations of ‘nan’ (not case sensitive).

Concerning the handling of infinity. CDT has no special handling for the various infinity-types, e.g, numpy.inf, -numpy.inf, float(‘inf’), float(‘-inf’), etc. This is a design decision to not force infinity values to numpy.nan. SPF falls back to the native handling of these values for Python and numpy. Specifically, numpy.inf==numpy.inf and float(‘inf’)==float(‘inf’).

Type Aliases

XContainer:: numpy.ndarray | pandas.DataFrame | polars.DataFrame | ss._csr.csr_matrix | ss._csc.csc_matrix | ss._coo.coo_matrix | ss._dia.dia_matrix | ss._lil.lil_matrix | ss._dok.dok_matrix | ss._bsr.bsr_matrix | ss._csr.csr_array | ss._csc.csc_array | ss._coo.coo_array | ss._dia.dia_array | ss._lil.lil_array | ss._dok.dok_array | ss._bsr.bsr_array
KeepType:: Literal[‘first’, ‘last’, ‘random’]
DoNotDropType:: Sequence[int] | Sequence[str] | None
ConflictType:: Literal[‘raise’, ‘ignore’]
DuplicatesType:: list[list[int]]
RemovedColumnsType:: dict[int, int]
ColumnMaskType:: numpy.ndarray[bool]
FeatureNamesInType:: numpy.ndarray[str]

Examples

>>> from pybear.preprocessing import ColumnDeduplicator as CDT
>>> import numpy as np
>>> np.random.seed(99)
>>> X = np.random.randint(0, 10, (5, 5))
>>> X[:, 2] = X[:, 0]
>>> X[:, 4] = X[:, 1]
>>> print(X)
[[1 3 1 8 3]
 [8 2 8 5 2]
 [1 7 1 7 7]
 [1 0 1 4 0]
 [2 0 2 8 0]]
>>> trf = CDT(keep='first', do_not_drop=None)
>>> trf.fit(X)
ColumnDeduplicator()
>>> out = trf.transform(X)
>>> print(out)
[[1 3 8]
 [8 2 5]
 [1 7 7]
 [1 0 4]
 [2 0 8]]
>>> print(trf.n_features_in_)
5
>>> print(trf.duplicates_)
[[0, 2], [1, 4]]
>>> print(trf.removed_columns_)
{2: 0, 4: 1}
>>> print(trf.column_mask_)
[ True  True False  True False]

property column_mask_#

Get the column_mask_ attribute.

Returns:

column_mask_numpy.ndarray[bool] of shape (n_features_in_,): Indicates which columns of the fitted data are kept (True) and which are removed (False) during transform.

property duplicates_#

Get the duplicates_ attribute.

Returns:

duplicates_list[list[int]]: a list of the groups of identical columns, indicated by their zero-based column index positions in the originally fit data.

fit(X, y=None)#

Perform a single fitting on a dataset.

Determine the duplicate columns in the data.

Parameters:

XXContainer of shape (n_samples, n_features): Required. The data to remove duplicate columns from.
yAny, default=None: Ignored. The target for the data.

Returns:

selfobject: The fitted ColumnDeduplicator instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

Xarray_like of shape (n_samples, n_features): Required. The data.
yarray_like of shape (n_samples, n_outputs) or (n_samples,): Optional, default=None. Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns:

X_trarray_like of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)#

Get the feature names for the output of transform.

Parameters:

input_featuresSequence[str] | None, default=None

Externally provided feature names for the fitted data, not the transformed data.

If input_features is None:

if feature_names_in_ is defined, then feature_names_in_ is used as the input features.

if feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

If input_features is not None:

if feature_names_in_ is not defined, then input_features is used as the input features.

if feature_names_in_ is defined, then input_features must exactly match the features in feature_names_in_.

Returns:

feature_names_outFeatureNamesInType: The feature names of the transformed data.

get_metadata_routing()#: Get metadata routing is not implemented.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:

deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:

paramsdict[str, Any]: Parameter names mapped to their values.

inverse_transform(X, copy=None)#

Revert deduplicated data back to its original state.

This operation cannot restore any nan-like values that may have been in the original untransformed data. set_output() does not control the output container here, the output container is always the same as passed.

Very little validation is possible to ensure that the passed data is valid for the current state of CDT. It is only possible to ensure that the number of columns in the passed data match the number of columns that are expected to be outputted by transform() for the current state of CDT. It is up to the user to ensure the state of CDT aligns with the state of the data that is to undergo inverse transform. Otherwise, the output will be nonsensical.

Parameters:

XXContainer of shape (n_samples, n_transformed_features): A transformed data set.
copybool | None, default=None: Whether to make a deepcopy of X before the inverse transform.

Returns:

X_inv: array-like of shape (n_samples, n_features): Transformed data reverted to its original untransformed state.

partial_fit(X, y=None)#

Perform incremental fitting on one or more batches of data.

Determine the duplicate columns in the data.

Parameters:

XXContainer of shape (n_samples, n_features): Required. Data to remove duplicate columns from.
yAny, default=None: Ignored. The target for the data.

Returns:

selfobject: The fitted ColumnDeduplicator instance.

property removed_columns_#

Get the removed_columns_ attribute.

Returns:

removed_columns_dict[int, int]: Dictionary whose keys are the indices of duplicate columns removed from the original data, indexed by their column location in the original data; the values are the column index in the original data of the respective duplicate that was kept.

score(X, y=None)#

Dummy method to spoof dask Incremental and ParallelPostFit wrappers.

Verified must be here for dask wrappers.

Parameters:

XAny: The data. Ignored.
yAny: The target for the data. Ignored.

Returns:

None

set_output(transform=None)#

Set the output container when the transform and fit_transform methods of the transformer are called.

Parameters:

transformLiteral[‘default’, ‘pandas’, ‘polars’] | None,

The default value for the transform parameter is None.

Configure the output of transform and fit_transform.

‘default’: Default output format (numpy array)

‘pandas’: pandas dataframe output

‘polars’: polars dataframe output

None: The output container is the same as the given container.

Returns:

selfobject: The transformer instance.

set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:

**paramsdict[str: Any]: The parameters to be updated and their new values.

Returns:

selfobject: The instance with new parameter values.

transform(X, copy=None)#

Remove the duplicate columns from X.

Apply the criteria given by keep, do_not_drop, and conflict to the sets of duplicate columns found during fit.

Parameters:

XXContainer of shape (n_samples, n_features): The data to be deduplicated.
copybool | None, default=None: Whether to make a deepcopy of X before the transform.

Returns:

X_trXContainer of shape (n_samples, n_features - n_removed_features): The deduplicated data.