TextNormalizer#

class pybear.feature_extraction.text.TextNormalizer(*, upper=True)#

Bases: FitTransformMixin, GetParamsMixin, ReprMixin, SetParamsMixin

Normalize all text in a dataset to upper-case, lower-case, or leave unchanged.

The data can only contain strings.

TextNormalizer (TN) accepts 1D list-like vectors of strings, such as Python lists, tuples, and sets, numpy vectors, pandas series, and polars series. TN also accepts 2D array-like containers such as (possibly ragged) nested 2D Python objects, numpy arrays, pandas dataframes, and polars dataframes. If you pass dataframes that have feature names, TN does not retain them. The returned objects are always constructed with Python lists, and have shape identical to the shape of the inputted data.

TN is a scikit-style transformer with partial_fit, fit, transform, fit_transform, get_params, set_params, and score methods. An instance is always in a ‘fitted’ state, and checks for fittedness will always return True. This is because TN technically does not need to be fit; it already knows everything it needs to know to do transforms from the single parameter. The partial_fit, fit, and score methods are no-op; they exist to fulfill the API and to enable TN to be incorporated into workflows such as scikit pipelines and dask_ml wrappers.

Parameters:

upperbool | None: If True, convert all text in X to upper-case; if False, convert to lower-case; if None, do a no-op.

Methods

`fit`(X[, y])	No-op one-shot fit.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_metadata_routing`()	get_metadata_routing is not implemented in TextNormalizer.
`get_params`([deep])	Get parameters for this instance.
`partial_fit`(X[, y])	No-op batch-wise fit.
`score`(X[, y])	No-op score method.
`set_params`(**params)	Set the parameters of an instance or a nested instance.
`transform`(X[, copy])	Normalize the text in a dataset.

See also

str.lower
str.upper

Notes

Type Aliases

PythonTypes:: Sequence[str] | Sequence[Sequence[str]]
NumpyTypes:: numpy.ndarray[str]
PandasTypes:: pandas.Series | pandas.DataFrame
PolarsTypes:: polars.Series | polars.DataFrame
XContainer:: PythonTypes | NumpyTypes | PandasTypes | PolarsTypes
XWipContainer:: list[str] | list[list[str]]
UpperType:: bool | None

Examples

>>> from pybear.feature_extraction.text import TextNormalizer as TN
>>> trfm = TN(upper=False)
>>> X1 = ['ThE', 'cAt', 'In', 'ThE', 'hAt']
>>> trfm.fit_transform(X1)
['the', 'cat', 'in', 'the', 'hat']
>>> trfm.set_params(upper=True)
TextNormalizer()
>>> X2 = [['One', 'Two', 'Three'], ['Ichi', 'Ni', 'Sa']]
>>> trfm.fit_transform(X2)
[['ONE', 'TWO', 'THREE'], ['ICHI', 'NI', 'SA']]

fit(X, y=None)#

No-op one-shot fit.

Parameters:

XXContainer: The text data to normalize.
yAny, default = None: The target for the data. Always ignored.

Returns:

selfobject: The TextNormalizer instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

Xarray_like of shape (n_samples, n_features): Required. The data.
yarray_like of shape (n_samples, n_outputs) or (n_samples,): Optional, default=None. Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns:

X_trarray_like of shape (n_samples, n_features_new): Transformed array.

get_metadata_routing()#: get_metadata_routing is not implemented in TextNormalizer.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:

deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:

paramsdict[str, Any]: Parameter names mapped to their values.

partial_fit(X, y=None)#

No-op batch-wise fit.

Parameters:

XXContainer: The text data to normalize.
yAny, default = None: The target for the data. Always ignored.

Returns:

selfobject: The TextNormalizer instance.

score(X, y=None)#

No-op score method.

Parameters:

XAny: The data. Ignored.
yAny, default = None: The target for the data. Ignored.

Returns:

None

set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:

**paramsdict[str: Any]: The parameters to be updated and their new values.

Returns:

selfobject: The instance with new parameter values.

transform(X, copy=False)#

Normalize the text in a dataset.

Parameters:

XXContainer: The text data to normalize.
copybool, default = False: Whether to normalize the text in the original X object or a deepcopy of X.

Returns:

X_trlist[str] | list[list[str]]: The data with normalized text.