NanStandardizer#
- class pybear.preprocessing.NanStandardizer(*, new_value=nan)#
Bases:
FitTransformMixin,GetParamsMixin,ReprMixin,SetParamsMixinConvert all nan-like representations in a dataset to the same value.
Standardize different nan-likes to the same nan-like value, or change them to a non-nan-like value. “nan-like representations” recognized by this transformer include, at least, np.nan, pandas.NA, None (of type None, not string “None”), and string representations of “nan”.
For details, see the docs for nan_mask_numerical and nan_mask_string.
This transformer accepts Python built-ins, numpy arrays, pandas dataframes/series, and polars dataframes/series of shape (n_samples, n_features) or (n_samples, ) and returns the same container with the value specified by the new_value parameter in the former positions of nan-like values. Also, when passing numerical data, this transformer accepts scipy sparse matrices / arrays of all formats except dok and lil. In that case, the original container is returned with the replacements made in the data attribute.
NanStandardizer (NS) is a full-fledged scikit-style transformer with partial_fit, fit, transform, fit_transform, get_params, set_params, and score methods. The partial_fit, fit, and score methods are no-ops that are available so that NS can be incorporated into larger workflows like scikit pipelines and dask_ml wrappers. NS is technically always in a fitted state because it does not to need to learn anything from data to do transformations, it knows everything it needs to know from its parameters. Tests for fittedness of a NS instance will always return True.
NS does not track the number of features in the data or the feature names. Attributes like n_features_in_, feature_names_in_ and methods like get_feature_names_out are not available. You should be able to pass any valid container at any time, regardless of what containers NS has seen previously.
- Parameters:
- new_valueAny, default=np.nan
The new value to put in place of the nan-like values. There is no validation for this value, the user is free to enter whatever they like. If there is a casting problem, i.e., the receiving object, the data, will not receive the given value, then any exceptions would be raised by the receiving object.
Methods
fit(X[, y])No-op one-shot fit of the NanStandardizer instance.
fit_transform(X[, y])Fit to data, then transform it.
Get metadata routing is not implemented in NanStandardizer.
get_params([deep])Get parameters for this instance.
partial_fit(X[, y])No-op batch-wise fit of the NanStandardizer instance.
score(X[, y])No-op score method.
set_params(**params)Set the parameters of an instance or a nested instance.
transform(X[, copy])Map the nan-like representations in X to new values.
See also
pybear.utilities.nan_mask_numericalpybear.utilities.nan_mask_stringnumpy.nanpandas.NA
Notes
Type Aliases
- PythonTypes:
Sequence | Sequence[Sequence]
- NumpyTypes:
numpy.ndarray
- PandasTypes:
pandas.DataFrame | pandas.Series
- PolarsTypes:
polars.DataFrame | polars.Series
- SparseTypes: (
ss._csr.csr_matrix | ss._csc.csc_matrix | ss._coo.coo_matrix | ss._dia.dia_matrix | ss._bsr.bsr_matrix | ss._csr.csr_array | ss._csc.csc_array | ss._coo.coo_array | ss._dia.dia_array | ss._bsr.bsr_array
)
- XContainer:
PythonTypes | NumpyTypes | PandasType | PolarsType | SparseTypes
Examples
>>> from pybear.preprocessing import NanStandardizer as NS >>> import pandas as pd >>> >>> trfm = NS(new_value=99) >>> X1 = np.array([[0, 1, np.nan], [np.nan, 4, 5]], dtype=np.float64) >>> trfm.fit_transform(X1) array([[ 0., 1., 99.], [99., 4., 5.]]) >>> trfm.set_params(new_value=pd.NA) NanStandardizer(new_value=<NA>) >>> X2 = pd.DataFrame([['a', 'b', np.nan], ['c', None, 'd']], dtype='O') >>> X2.columns = list('xyz') >>> trfm.fit_transform(X2) x y z 0 a b <NA> 1 c <NA> d
- fit(X, y=None)#
No-op one-shot fit of the NanStandardizer instance.
- Parameters:
- XXContainer of shape (n_samples, n_features) or (n_samples,)
The object for which to replace nan-like representations. Ignored.
- yAny, default=None
The target for the data. Ignored.
- Returns:
- None
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Required. The data.
- yarray_like of shape (n_samples, n_outputs) or (n_samples,)
Optional, default=None. Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_trarray_like of shape (n_samples, n_features_new)
Transformed array.
- get_metadata_routing()#
Get metadata routing is not implemented in NanStandardizer.
- get_params(deep=True)#
Get parameters for this instance.
The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.
- Parameters:
- deepbool, default = True
For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.
For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.
- Returns:
- paramsdict[str, Any]
Parameter names mapped to their values.
- partial_fit(X, y=None)#
No-op batch-wise fit of the NanStandardizer instance.
- Parameters:
- XXContainer of shape (n_samples, n_features) or (n_samples,)
The object for which to replace nan-like representations. Ignored.
- yAny, default=None
The target for the data. Ignored.
- Returns:
- None
- score(X, y=None)#
No-op score method.
Needs to be here for dask_ml wrappers.
- Parameters:
- XAny
The data. Ignored.
- yAny, default=None
The target for the data. Ignored.
- Returns:
- None
- set_params(**params)#
Set the parameters of an instance or a nested instance.
This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).
Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with
get_params().Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.
Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.
The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.
- Parameters:
- **paramsdict[str: Any]
The parameters to be updated and their new values.
- Returns:
- selfobject
The instance with new parameter values.
- transform(X, copy=False)#
Map the nan-like representations in X to new values.
- Parameters:
- XXContainer of shape (n_samples, n_features) or (n_samples,)
The object for which to replace nan-like representations.
- copybool, default=False
Whether to replace the values directly in the original X or in a deepcopy of X.
- Returns:
- X_trXContainer of shape (n_samples, n_features), (n_samples,)
The data with new values in the locations previously occupied by nan-like values.