StopRemover#

class pybear.feature_extraction.text.StopRemover(match_callable=None, remove_empty_rows=True, exempt=None, supplemental=None, n_jobs=-1)#

Bases: FitTransformMixin, GetParamsMixin, ReprMixin, SetParamsMixin

Remove stop words from text data.

StopRemover (SR) uses the stop words defined in the pybear Lexicon stop_words_ attribute to locate and remove stop words from a 2D array-like body of text data.

pybear wants to deliver a robust and predictable output for your inputs, and recommends that SR should only be used on highly processed data. SR should not be the first (or even near the first) step in a complex text wrangling workflow. This should be one of the last steps. An example of the last steps of a workflow using pybear text wrangling modules could be:

… > TextLookup > StopRemover > NGramMerger > TextJustifier.

To this end, SR only accepts tokenized text in 2D array-like format. Trying to manage the contingencies of replacing stop words vis-a-vis individual user preferences concerning adjoining characters and the impact on white space in long text strings as would be in 1D format is intractable. Therefore, pybear pushes back to the user to require that the data be processed at least to the point where you know what your separators are and you are able to split your data into tokens. If you have 1D data and know what your separators are as either string literal or regex patterns, use pybear TextSplitter to convert your data to 2D before using SR. Accepted 2D objects include Python list/tuple of lists/tuples, numpy arrays, pandas dataframes, and polars dataframes. Results are always returned as a Python list of lists of strings.

The default text comparer in SR does a case-insensitive, exact character-to-character match of each token in the text body against the stop words, and removes a word from the text when there is a match. If you want to override the default SR case-insensitive behavior, pass a new callable to the match_callable parameter. The callable can take anything that you can put into a callable, as long as the signature is [str, str] and returns a boolean. If you would like to do your stop word matching with regular expressions, then by all means put that in your callable.

Optionally, you can instruct SR to remove any empty rows that may be left after the stop word removal process. After transform, SR exposes the row_support_ attribute which is a boolean vector that shows which rows in the data were kept (True) and which ones were removed (False). The only way an entry in this vector could become False is if the remove_empty_rows parameter is True and a row became empty during the stop word removal process. row_support_ only reflects the last dataset passed to transform.

SR is a full-fledged scikit-style transformer. It has fully functional get_params, set_params, transform, and fit_transform methods. It also has partial_fit, fit, and score methods, which are no-ops. SR technically does not need to be fit because it already knows everything it needs to do transformations from the parameters and the stop words in the pybear Lexicon. These no-op methods are available to fulfill the scikit transformer API and make SR suitable for incorporation into larger workflows, such as Pipelines and dask_ml wrappers.

Because SR doesn’t need any information from partial_fit() and fit(), it is technically always in a ‘fitted’ state and ready to transform data. Checks for fittedness will always return True.

SR has an n_rows_ attribute which is only available after data has been passed to transform(). n_rows_ is the number of rows of text seen in the original data, and must match the number of entries in row_support_.

Parameters:

match_callableCallable[[str, str], bool] | None, default = None: None to use the default StopRemover matching criteria, or a custom callable that defines what constitutes matches of words in the text against the stop words. In pre-run validation, SR only checks that match_callable is None or a callable, no validation is done on the callable. It is a heavy burden to validate the user-defined callable at every call over a search of the entire text body for every stop word, so SR does not validate any of it. If the user-defined callable is ill-formed, SR could break in unpredictable ways, or, perhaps worse, SR may not break and successfully complete the search operation and yield nonsensical results. It is up to the user to validate the accuracy of their callable and ensure that the output is a boolean. When designing the callable, the first string in the signature is the word from the text, the second string is a stop word. If you have modified your local copy of the Lexicon and/or the stop words and you intend to use regex in your callable, remember that it may be important to use re.escape.
remove_empty_rowsbool, default = True: Whether to remove any rows that are left empty by the stop word removal process.
exemptlist[str] | None, default = None: Stop words that are exempted from the search. Text that matches these words will not be removed. Ensure that the capitalization of the word(s) that you enter exactly matches that of the word(s) in the Lexicon. Always enter words in majuscule if working with the default pybear Lexicon.
supplementallist[str] | None, default = None: Words to be removed in addition to the stop words. If you intend to do a case-sensitive search then the capitalization of these words matters.
n_jobsint | None, default = -1: The number of cores/threads to use when parallelizing the search for stop words in the rows of X. The default is to use processes but can be set by running StopRemover under a joblib parallel_config context manager. None uses the default number of cores/threads. -1 uses all available cores/threads.

Attributes:

n_rows_: Get the n_rows_ attribute.
row_support_: Get the row_support_ attribute.

Methods

`fit`(X[, y])	No-op one-shot fit method.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_metadata_routing`()	get_metadata_routing is not implemented in StopRemover.
`get_params`([deep])	Get parameters for this instance.
`partial_fit`(X[, y])	No-op batch-wise fit method.
`score`(X[, y])	No-op score method.
`set_params`(**params)	Set the parameters of an instance or a nested instance.
`transform`(X[, copy])	Scan X and remove any stop words as defined in the pybear Lexicon stop_words_ attribute.

Notes

Type Aliases

PythonTypes:: Sequence[Sequence[str]]
NumpyTypes:: numpy.ndarray
PandasTypes:: pandas.DataFrame
PolarsTypes:: polars.DataFrame
XContainer:: PythonTypes | NumpyTypes | PandasTypes | PolarsTypes
XWipContainer:: list[list[str]]
RowSupportType:: numpy.ndarray[bool]

Examples

>>> from pybear.feature_extraction.text import StopRemover as SR
>>> trfm = SR(remove_empty_rows=True, n_jobs=1)
>>> X = [
...     ['but', 'I', 'like', 'to', 'be', 'here'],
...     ['oh', 'I', 'like', 'it', 'a', 'lot'],
...     ['said', 'the', 'cat', 'in', 'the', 'hat'],
...     ['to', 'the', 'fish', 'in', 'the', 'pot']
... ]
>>> trfm.transform(X)
[['oh', 'lot'], ['cat', 'hat'], ['fish', 'pot']]

fit(X, y=None)#

No-op one-shot fit method.

Parameters:

XXContainer: The (possibly ragged) 2D container of text from which to remove stop words. Ignored.
yAny, default = None: The target for the data. Always ignored.

Returns:

selfobject: The StopRemover instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

Xarray_like of shape (n_samples, n_features): Required. The data.
yarray_like of shape (n_samples, n_outputs) or (n_samples,): Optional, default=None. Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns:

X_trarray_like of shape (n_samples, n_features_new): Transformed array.

get_metadata_routing()#: get_metadata_routing is not implemented in StopRemover.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:

deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:

paramsdict[str, Any]: Parameter names mapped to their values.

property n_rows_#

Get the n_rows_ attribute.

The number of rows in the data passed to transform().

Returns:

n_rows_int: The number of rows in the data passed to transform.

partial_fit(X, y=None)#

No-op batch-wise fit method.

Parameters:

XXContainer: The (possibly ragged) 2D container of text from which to remove stop words. Ignored.
yAny, default = None: The target for the data. Always ignored.

Returns:

selfobject: The StopRemover instance.

property row_support_#

Get the row_support_ attribute.

A 1D boolean numpy vector indicating which rows have been kept (True) after the stop word removal process. Entries in this vector could only become False if remove_empty_rows is True and one or more rows became empty during the transform process. The row_support_ attribute is only available if a transform has been performed, and only reflects the last dataset passed to transform().

Returns:

row_support_numpy.ndarray[bool] of shape (n_original_rows, ): A 1D boolean numpy vector indicating which rows have been kept (True) after the stop word removal process.

score(X, y=None)#

No-op score method.

Needs to be here for dask_ml wrappers.

Parameters:

XAny: The data. Ignored.
yAny, default = None: The target for the data. Ignored.

Returns:

None

set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:

**paramsdict[str: Any]: The parameters to be updated and their new values.

Returns:

selfobject: The instance with new parameter values.

transform(X, copy=False)#

Scan X and remove any stop words as defined in the pybear Lexicon stop_words_ attribute.

Optionally removes any empty rows left by the stop word removal process. Once data has been passed, the n_rows_ and row_support_ attributes are exposed. The row_support_ attribute is a boolean numpy vector that indicates which rows in the original X were kept during transform (True); entries could only become False if the remove_empty_rows parameter is True and at least one row became empty during the stop word removal process. The row_support_ attribute only reflects the last dataset passed to transform().

Parameters:

XXContainer: The (possibly ragged) 2D container of text from which to remove stop words.
copybool, default = False: Whether to remove stop words directly from the passed X or a deepcopy of X.

Returns:

X_trlist[list[str]]: The data with stop words removed.