TextRemover#

class pybear.feature_extraction.text.TextRemover(*, remove=None, case_sensitive=True, remove_empty_rows=False, flags=None)#

Bases: FitTransformMixin, GetParamsMixin, ReprMixin, SetParamsMixin

Remove full strings (not substrings) from text data.

Identify full strings to remove by literal string equality or by regular expression fullmatch. Remove any and all matches completely from the data.

One particularly useful application is to take out empty or gibberish strings in data read in from a file. Another is to remove strings that have become empty or have only non-alphanumeric characters after replacing values (see pybear TextReplacer).

TextRemover (TR) always looks for matches against entire strings, it does not do partial matches. You can tell TR what strings to remove with literal strings or regular expressions in re.compile objects passed to remove. Pass literal strings or re.compile objects that are intended to match entire words. DO NOT PASS A REGEX PATTERN AS A LITERAL STRING. YOU WILL NOT GET THE CORRECT RESULT. ALWAYS PASS REGEX PATTERNS IN A re.compile OBJECT. DO NOT ESCAPE LITERAL STRINGS, TextRemover WILL DO THAT FOR YOU. If you don’t know what any of that means, then you don’t need to worry about it.

TR searches always default to case-sensitive, but can be made to be case-insensitive. You can globally set this behavior via the case_sensitive parameter. For those of you that know regex, you can also put flags in the re.compile objects passed to remove, or flags can be set globally via flags. Case-sensitivity is generally controlled by case_sensitive but IGNORECASE flags passed via re.compile objects or flags will always overrule case_sensitive.

So why not just use regular literal string matching or re.fullmatch to find strings and remove them? Unlike those, TR accepts multiple patterns to search for and remove. TR can remove multiple strings in one call by passing a tuple of literal strings and/or re.compile objects to remove. But if you need fine-grained control on certain rows of data, remove, case_sensitive, and/or flags can be passed as lists indicating specific instructions for individual rows. When any of these are passed as a list, the number of entries in the list must equal the number of rows in the data. What is allowed to be put in the lists is dictated by the allowed global values for each respective parameter.

TextRemover is a full-fledged scikit-style transformer. It has fully functional get_params, set_params, transform, and fit_transform methods. It also has no-op partial_fit and fit methods to allow for integration into larger workflows, like scikit pipelines. Technically TR does not need to be fit and is always in a fitted state (any ‘is_fitted’ checks of an instance will always return True) because TR knows everything it needs to know to transform data from the parameters. It also has a no-op score method to allow dask_ml wrappers.

Accepts 1D list-like and (possibly ragged) 2D array-likes of strings. Accepted 1D containers include Python lists, tuples, and sets, numpy vectors, pandas series, and polars series. Accepted 2D containers include embedded Python sequences, numpy arrays, pandas dataframes, and polars dataframes. When passed a 1D list-like, returns a Python list of strings. When passed a 2D array-like, returns a Python list of Python lists of strings. If you pass your data as a dataframe with feature names, the feature names are not preserved.

By definition, a row is removed from 1D data when an entire string is removed. This behavior is unavoidable, in this case TextRemover must mutate along the example axis. However, the user can control this behavior for 2D containers. remove_empty_rows is a boolean that indicates to TR whether to remove any rows that may have become (or may have been given as) empty after removing unwanted strings. If True, TR will remove any empty rows from the data and those rows will be indicated in the row_support_ mask by a False in their respective positions. It is possible that empty 1D lists are returned. If False, empty rows are not removed from the data.

TextRemover instances that have undergone a transform operation expose 2 attributes. n_rows_ is the number of rows in the data last passed to transform, which may be different from the number of rows returned. row_support_ is a boolean numpy vector indicating which rows were kept (True) and which were removed (False) fram the data during the last transform. This mask can be applied to a target for the data (if any) so that the rows in the target match the rows in the data after transform. The length of row_support_ must equal n_rows_. Neither of these attributes are cumulative, they only reflect the last dataset passed to transform().

Parameters:
removeRemoveType, default = None

The literal strings or regex patterns to remove from the data. When passed as a single literal string or re.compile object, that is applied to every string in the data, and every full string that matches exactly will be removed. When passed as a Python tuple of character strings and/or re.compile objects, each pattern is searched against all the strings in the data and any exact matches are removed. If passed as a list, the number of entries must match the number of rows in X, and each string, re.compile, or tuple is applied to the corresponding row in the data. If any entry in the list is None, the corresponding row in the data is ignored.

case_sensitiveCaseSensitiveType, default = True

Global setting for case-sensitivity. If True (the default) then all searches are case-sensitive. If False, TR will look for matches regardless of case. This setting is overriden when IGNORECASE flags are passed in re.compile objects or to flags.

remove_empty_rowsbool, default = False

Whether to remove rows that become empty when data is passed in a 2D container. This does not apply to 1D data. If True, TR will remove any empty rows from the data and that row will be indicated in the row_support_ mask by a False in that position. If False, empty rows are not removed from the data.

flagsFlagsType, default = None

The flags value(s) for the full string searches. Internally, TR does all its searching for strings with re.fullmatch, therefore flags can be passed whether you are searching for literal strings or regex patterns. If you do not know regular expressions, then you do not need to worry about this parameter. If None, the default flags for re.fullmatch are used globally. If a single flags object, that is applied globally. If passed as a list, the number of entries must match the number of rows in the data. Flags objects and Nones in the list follow the same rules stated above, but at the row level. If IGNORECASE is passed here as a global setting or in a list it overrides the case_sensitive ‘True’ setting.

Attributes:
n_rows_

Get the n_rows_ attribute.

row_support_

Get the row_support_ attribute.

Methods

fit(X[, y])

One-shot no-op fit operation.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

metadata routing is not implemented in TextRemover.

get_params([deep])

Get parameters for this instance.

partial_fit(X[, y])

Batch-wise no-op fit operation.

score(X[, y])

No-op score method to allow wrap by dask_ml wrappers.

set_params(**params)

Set the parameters of an instance or a nested instance.

transform(X[, copy])

Remove unwanted strings from the data.

See also

list.remove
re.fullmatch

Notes

Type Aliases

PythonTypes:

Sequence[str] | Sequence[Sequence[str]] | set[str]

NumpyTypes:

numpy.ndarray[str]

PandasTypes:

pandas.Series | pandas.DataFrame

PolarsTypes:

polars.Series | polars.DataFrame

XContainer:

PythonTypes | NumpyTypes | PandasTypes | PolarsTypes

XWipContainer:

list[str] | list[list[str]]

PatternType:

None | str | re.Pattern[str] | tuple[str | re.Pattern[str], …]

RemoveType:

PatternType | list[PatternType]

WipPatternType:

None | re.Pattern[str] | tuple[re.Pattern[str], …]

WipRemoveType:

WipPatternType | list[WipPatternType]

CaseSensitiveType:

bool | list[bool | None]

RemoveEmptyRowsType:

bool

FlagType:

None | int

FlagsType:

FlagType | list[FlagType]

RowSupportType:

numpy.ndarray[bool]

Examples

>>> from pybear.feature_extraction.text import TextRemover as TR
>>> trfm = TR(remove=(' ', ''))
>>> X = [' ', 'One', 'Two', '', 'Three', ' ']
>>> trfm.fit_transform(X)
['One', 'Two', 'Three']
>>> trfm.set_params(**{'remove': re.compile('[bcdei]')})
TextRemover(remove=re.compile('[bcdei]'))
>>> X = [['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
>>> trfm.fit_transform(X)
[['a'], ['f'], ['g', 'h']]
fit(X, y=None)#

One-shot no-op fit operation.

Parameters:
XXContainer

The data. Ignored.

yAny, default=None

The target for the data. Always ignored.

Returns:
selfobject

The TextRemover instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Required. The data.

yarray_like of shape (n_samples, n_outputs) or (n_samples,)

Optional, default=None. Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_trarray_like of shape (n_samples, n_features_new)

Transformed array.

get_metadata_routing()#

metadata routing is not implemented in TextRemover.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:
deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:
paramsdict[str, Any]

Parameter names mapped to their values.

property n_rows_#

Get the n_rows_ attribute.

The number of rows in the data passed to transform(). This reflects the data that is passed, not the data that is returned, which may not necessarily have the same number of rows as the original data. Only available if a transform has been performed, and only reflects the results of the last transform done, it is not cumulative.

Returns:
n_rows_int

The number of rows in the data passed to transform.

partial_fit(X, y=None)#

Batch-wise no-op fit operation.

Parameters:
XXContainer

The data. Ignored.

yAny, default=None

The target for the data. Always ignored.

Returns:
selfobject

The TextRemover instance.

property row_support_#

Get the row_support_ attribute.

A boolean vector indicating which rows were kept (True) or removed (False) during the transform process. Only available if a transform has been performed, and only reflects the results of the last transform done, it is not cumulative.

Returns:
row_support_numpy.ndarray[bool]

A boolean vector indicating which rows were kept in the data during the transform process.

score(X, y=None)#

No-op score method to allow wrap by dask_ml wrappers.

Parameters:
XXContainer

The data. Ignored.

yAny, default=None

The target for the data. Ignored.

Returns:
None
set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:
**paramsdict[str: Any]

The parameters to be updated and their new values.

Returns:
selfobject

The instance with new parameter values.

transform(X, copy=False)#

Remove unwanted strings from the data.

Parameters:
XXContainer

The data.

copybool, default=False

Whether to remove unwanted strings directly from the original X or from a deepcopy of the original X.

Returns:
XXWipContainer

The data with unwanted strings removed.