TextStripper#

class pybear.feature_extraction.text.TextStripper#

Bases: FitTransformMixin, GetParamsMixin, ReprMixin, SetParamsMixin

Strip leading and trailing spaces from 1D or 2D text data.

The data can only contain strings.

TextStripper is a scikit-style transformer that has partial_fit, fit, transform, fit_transform, set_params, get_params, and score methods.

TextStripper technically does not need fitting as it already has all the information it needs to perform transforms. Checks for fittedness will always return True. The partial_fit, fit, and score methods are no-ops that allow TextStripper to be incorporated into larger workflows such as scikit pipelines or dask_ml wrappers. The get_params, set_params, transform, and fit_transform methods are fully functional, but get_params and set_params are trivial because TextStripper has no parameters and no attributes.

TextStripper can transform 1D list-likes of strings and (possibly ragged) 2D array-likes of strings. Accepted 1D containers include Python lists, tuples, and sets, numpy vectors, pandas series, and polars series. Accepted 2D containers include embedded Python sequences, numpy arrays, pandas dataframes, and polar dataframes. When passed a 1D list-like, a single Python list of strings is returned. When passed a possibly ragged 2D array-like of strings, TextStripper will return an equally sized and also possibly ragged Python list of Python lists of strings.

TextStripper has no parameters and no attributes.

Methods

fit(X[, y])

No-op one-shot fit.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

'get_metadata_routing' is not implemented in TextStripper.

get_params([deep])

Get parameters for this instance.

partial_fit(X[, y])

No-op batch-wise fit.

score(X[, y])

No-op score method.

set_params(**params)

Set the parameters of an instance or a nested instance.

transform(X[, copy])

Remove the leading and trailing spaces from 1D or 2D text data.

Notes

Type Aliases

PythonTypes:

Sequence[str] | Sequence[Sequence[str]]

NumpyTypes:

numpy.ndarray[str]

PandasTypes:

pandas.Series | pandas.DataFrame

PolarsTypes:

polars.Series | polars.DataFrame

XContainer:

PythonTypes | NumpyTypes | PandasTypes | PolarsTypes

XWipContainer:

list[str] | list[list[str]]

Examples

>>> from pybear.feature_extraction.text import TextStripper as TS
>>> trfm = TS()
>>> X = ['  a   ', 'b', '   c', 'd   ']
>>> trfm.fit_transform(X)
['a', 'b', 'c', 'd']
>>> X = [['w   ', '', 'x   '], ['  y  ', 'z   ']]
>>> trfm.fit_transform(X)
[['w', '', 'x'], ['y', 'z']]
fit(X, y=None)#

No-op one-shot fit.

Parameters:
XXContainer

The data whose text will be stripped of leading and trailing spaces. Ignored.

yAny, default = None

The target for the data. Always ignored.

Returns:
selfobject

The TextStripper instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Required. The data.

yarray_like of shape (n_samples, n_outputs) or (n_samples,)

Optional, default=None. Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_trarray_like of shape (n_samples, n_features_new)

Transformed array.

get_metadata_routing()#

‘get_metadata_routing’ is not implemented in TextStripper.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:
deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:
paramsdict[str, Any]

Parameter names mapped to their values.

partial_fit(X, y=None)#

No-op batch-wise fit.

Parameters:
XXContainer

The data whose text will be stripped of leading and trailing spaces. Ignored.

yAny, default = None

The target for the data. Always ignored.

Returns:
selfobject

The TextStripper instance.

score(X, y=None)#

No-op score method.

Parameters:
XAny

The data. Ignored.

yAny, default = None

The target for the data. Ignored.

Returns:
None
set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:
**paramsdict[str: Any]

The parameters to be updated and their new values.

Returns:
selfobject

The instance with new parameter values.

transform(X, copy=False)#

Remove the leading and trailing spaces from 1D or 2D text data.

Parameters:
XXContainer

The data whose text will be stripped of leading and trailing spaces.

copybool, default = False

Whether to strip the text in the original X object or a deepcopy of X.

Returns:
X_tr: XWipContainer

The data with stripped text.