TextStripper#
- class pybear.feature_extraction.text.TextStripper#
Bases:
FitTransformMixin,GetParamsMixin,ReprMixin,SetParamsMixinStrip leading and trailing spaces from 1D or 2D text data.
The data can only contain strings.
TextStripper is a scikit-style transformer that has partial_fit, fit, transform, fit_transform, set_params, get_params, and score methods.
TextStripper technically does not need fitting as it already has all the information it needs to perform transforms. Checks for fittedness will always return True. The partial_fit, fit, and score methods are no-ops that allow TextStripper to be incorporated into larger workflows such as scikit pipelines or dask_ml wrappers. The get_params, set_params, transform, and fit_transform methods are fully functional, but get_params and set_params are trivial because TextStripper has no parameters and no attributes.
TextStripper can transform 1D list-likes of strings and (possibly ragged) 2D array-likes of strings. Accepted 1D containers include Python lists, tuples, and sets, numpy vectors, pandas series, and polars series. Accepted 2D containers include embedded Python sequences, numpy arrays, pandas dataframes, and polar dataframes. When passed a 1D list-like, a single Python list of strings is returned. When passed a possibly ragged 2D array-like of strings, TextStripper will return an equally sized and also possibly ragged Python list of Python lists of strings.
TextStripper has no parameters and no attributes.
Methods
fit(X[, y])No-op one-shot fit.
fit_transform(X[, y])Fit to data, then transform it.
'get_metadata_routing' is not implemented in TextStripper.
get_params([deep])Get parameters for this instance.
partial_fit(X[, y])No-op batch-wise fit.
score(X[, y])No-op score method.
set_params(**params)Set the parameters of an instance or a nested instance.
transform(X[, copy])Remove the leading and trailing spaces from 1D or 2D text data.
Notes
Type Aliases
- PythonTypes:
Sequence[str] | Sequence[Sequence[str]]
- NumpyTypes:
numpy.ndarray[str]
- PandasTypes:
pandas.Series | pandas.DataFrame
- PolarsTypes:
polars.Series | polars.DataFrame
- XContainer:
PythonTypes | NumpyTypes | PandasTypes | PolarsTypes
- XWipContainer:
list[str] | list[list[str]]
Examples
>>> from pybear.feature_extraction.text import TextStripper as TS >>> trfm = TS() >>> X = [' a ', 'b', ' c', 'd '] >>> trfm.fit_transform(X) ['a', 'b', 'c', 'd'] >>> X = [['w ', '', 'x '], [' y ', 'z ']] >>> trfm.fit_transform(X) [['w', '', 'x'], ['y', 'z']]
- fit(X, y=None)#
No-op one-shot fit.
- Parameters:
- XXContainer
The data whose text will be stripped of leading and trailing spaces. Ignored.
- yAny, default = None
The target for the data. Always ignored.
- Returns:
- selfobject
The TextStripper instance.
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Required. The data.
- yarray_like of shape (n_samples, n_outputs) or (n_samples,)
Optional, default=None. Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_trarray_like of shape (n_samples, n_features_new)
Transformed array.
- get_metadata_routing()#
‘get_metadata_routing’ is not implemented in TextStripper.
- get_params(deep=True)#
Get parameters for this instance.
The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.
- Parameters:
- deepbool, default = True
For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.
For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.
- Returns:
- paramsdict[str, Any]
Parameter names mapped to their values.
- partial_fit(X, y=None)#
No-op batch-wise fit.
- Parameters:
- XXContainer
The data whose text will be stripped of leading and trailing spaces. Ignored.
- yAny, default = None
The target for the data. Always ignored.
- Returns:
- selfobject
The TextStripper instance.
- score(X, y=None)#
No-op score method.
- Parameters:
- XAny
The data. Ignored.
- yAny, default = None
The target for the data. Ignored.
- Returns:
- None
- set_params(**params)#
Set the parameters of an instance or a nested instance.
This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).
Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with
get_params().Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.
Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.
The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.
- Parameters:
- **paramsdict[str: Any]
The parameters to be updated and their new values.
- Returns:
- selfobject
The instance with new parameter values.
- transform(X, copy=False)#
Remove the leading and trailing spaces from 1D or 2D text data.
- Parameters:
- XXContainer
The data whose text will be stripped of leading and trailing spaces.
- copybool, default = False
Whether to strip the text in the original X object or a deepcopy of X.
- Returns:
- X_tr: XWipContainer
The data with stripped text.