AutoTextCleaner#

class pybear.feature_extraction.text.AutoTextCleaner(*, global_sep=' ', case_sensitive=True, global_flags=None, remove_empty_rows=False, return_dim=None, strip=False, replace=None, remove=None, normalize=None, lexicon_lookup=None, remove_stops=False, ngram_merge=None, justify=None, get_statistics=None)#

Bases: FitTransformMixin, GetParamsMixin, ReprMixin, SetParamsMixin

A quick, convenient transformer that streamlines basic everyday data cleaning needs.

This module is not meant to do highly specialized text cleaning operations (A transformer designed to do that using the same underling pybear sub-transformers might have 50 parameters; 14 is enough.) If you cannot accomplish what you are trying to do with this module out of the box, then you will need to construct your own workflow piece by piece with the individual pybear text modules.

AutoTextCleaner (ATC) combines the functionality of the pybear text transformers into one module. In one shot you can strip, normalize, replace, remove, and justify text. You can also cross-reference the text against the pybear Lexicon, handle unknown words, remove stops, and merge n-grams. All the while, ATC is capable of compiling statistics about the incoming and outgoing text.

ATC adds no new functionality beyond what is in the other pybear text transformers; it simply lines them up and runs them all at once with one call to transform(). All the information about the inner workings of this module is available in the docs for the submodules.

This module does have parameters and attributes that are unique to it. The documentation here mostly highlights these unique characteristics and points the reader to other documentation for more information.

Parameters that require information about text patterns to search, such as remove, replace, and ngram_merge, can take literal strings or regular expression patterns in re.compile objects. If you don’t know regex, don’t worry about the references to it in these docs, you can still use all the functionality of ATC. For the super-users, you can get more control over ATC’s operations with regex patterns in re.compile objects and the global_flags parameter. All users should know that flags passed to global_flags will also apply to any literal strings used as search criteria. When using regex, ALWAYS pass your regex patterns in a re.compile object. DO NOT PASS A REGEX PATTERN AS A LITERAL STRING. YOU WILL NOT GET THE CORRECT RESULT. ALWAYS PASS REGEX PATTERNS IN A re.compile OBJECT. DO NOT ESCAPE LITERAL STRINGS, ATC WILL DO THAT FOR YOU. If you don’t know what any of that means, then you don’t need to worry about it.

IMPORTANT: if you want to use the lexicon_lookup parameter and check your text against the pybear Lexicon, remember that the Lexicon is majuscule and has no non-alpha characters. You MUST set normalize to True to get meaningful results, or you risk losing content that is not the correct case. Also, when you are in the manual text lookup process and are entering words at the prompts to replace unknown words in your text, whatever is entered is inserted into your text exactly as entered by you. You must enter the text exactly as you want it in the cleaned output. If normalizing the text is important to you, you must enter the text in the case that you want in the output, ATC will not do it for you.

ATC is a full-fledged scikit-style transformer. It has functional get_params, set_params, transform, and fit_transform methods. It also has no-op partial_fit() and fit() methods to allow for integration into larger workflows, like scikit pipelines. Technically ATC does not need to be fit and is always in a fitted state (any ‘is_fitted’ checks of an instance will always return True) because ATC knows everything it needs to know to transform data from the parameters. It also has a no-op score method to allow dask_ml wrappers.

When using set_params() to change the ATC instance’s parameters away from those passed at instantiation, always make a call to no-op fit to reset the instance. The submodules are instantiated when ATC is instantiated, so when the parameters that impact the submodules change, the submodules need to be instantiated again.

ATC accepts 1D list-like and (possibly ragged) 2D array-likes of strings. Accepted 1D containers include Python lists, tuples, and sets, numpy vectors, pandas series, and polars series. Accepted 2D containers include embedded Python sequences, numpy arrays, pandas dataframes, and polars dataframes. The dimensionality of the output can be controlled by the return_dim parameter. When data is returned in 1D format, the output is a Python list of strings. When the data is returned in 2D format, the output is a Python list of Python lists of strings. If you pass your data as a dataframe with feature names, the feature names are not preserved.

Parameters:
global_sepstr, default=’ ‘

The single literal character sequence that is used throughout the text cleaning process for joining 1D data, splitting 2D data, and identifying wrap points when justifying. A common separator (and the default) is ‘ ‘.

case_sensitivebool, default=True

Whether searches for the things to replace, things to remove, etc., are case-sensitive. This generally controls case-senstivity globally, but for those of you that know regex, an IGNORECASE flag passed to global_flags will always overrule this parameter.

global_flagsint | None, default=None

The regex flags for operations that do searches within the text, like replace and remove. If you do not know regex, then you don’t need to worry about this, just pass literal strings to the other parameters. While case-sensitive generally controls case-sensitivity, an IGNORECASE flag passed here will always overrule.

remove_empty_rowsbool, default=False

Some operations during the cleaning process, such as remove character patterns and/or stop words, ngram merge, and Lexicon lookup, may leave some rows with no strings in them. If this happens and this parameter is True, then that empty row is removed from the data.

return_dimReturnDimType, default=None

The desired dimension of the outputted data. If None (default), then the outputted container has the same dimenstionality as the given container. If 1 or 2, then that is the dimensionality of the outputted container.

stripbool, default=False

Whether to remove leading and trailing spaces from strings in the text.

replaceReplaceType, default=None

The search and replace strategy. Pass search and replace pairs in tuples, with a literal string or re.compile object as the search criteria, and a literal string or callable as the replace criteria. Pass multiple search and replace tuples in a single enveloping tuple. See the docs for pybear TextReplacer for more information about this parameter.

removeRemoveType, default=None

The literal strings or regex patterns to remove from the data. When passed as a single literal string or re.compile object, that is applied to every string in the data, and every full string that matches exactly will be removed. When passed as a Python tuple of character strings and/or re.compile objects, each pattern is searched against all the strings in the data and any exact matches are removed. See the docs for pybear TextRemover for more information.

normalizebool | None, default=None

If True, convert all text in X to upper-case; if False, convert to lower-case; if None, do a no-op.

lexicon_lookupLexiconLookupType | None, default=None

Remember that the pybear Lexicon is majuscule, so your text should be also if you choose to use this. When None, skip the Lexicon lookup process. Otherwise, must be a dictionary of parameters for TextLookupRealTime. If remove_empty_rows is passed here, it will override the ATC remove_empty_rows parameter, otherwise the ATC remove_empty_rows parameter will be used. See lexicon_lookup_ for more information. Also see the docs for pybear TextLookupRealTime for information about the parameters and the Lexicon lookup process.

remove_stopsbool, default=False

Whether to remove pybear-defined stop words from the text.

ngram_mergeNGramsType | None, default=None

When None, do not merge ngrams. To pass parameters to this, pass a dictionary with the keys ‘ngrams’ and ‘wrap’. Set the value of ‘ngrams’ with a sequence of sequences, where each inner sequence holds a series of string literals and/or re.compile objects that specify an n-gram. Cannot be empty, and cannot have any n-gram patterns with less than 2 entries. The ‘wrap’ key takes a boolean value. True will look for ngram merges around the beginnings and ends of adjacent lines, False will only look for ngrams within the contiguous text of one line. See pybear NGramMerger for more information.

justifyint | None, default=None

When None do not justify the text. Otherwise, pass an integer to indicate to ATC to justify the data to that character width. When this is not None, i.e., an integer is passed, ATC does not expose the row_support_ attribute.

get_statisticsGetStatisticsType | None

None or a dictionary keyed with ‘before’ and ‘after’. When None, do not accumulate statistics about the incoming and outgoing text. When passed as a dictionary, both keys must be present. With these keys, you are able to enable or disable statistics logging for both incoming and outgoing text. To disable either of the statistics, pass None to that key. Otherwise, pass a boolean. False does not disable the statistics! The boolean indicates to the respective TextStatistics instance whether to retain unique strings seen within itself to provide the full statistics it is capable of. If True, retain uniques seen by that respective TextStatistics instance. This may lead to a RAM limiting situation, especially for dirty incoming text. To not retain the uniques seen within the TextStatistics instance, set this to False, and some, but not all, statistics will still be tracked. See pybear TextStatistics for more information.

Attributes:
n_rows_

Get the n_rows_ attribute.

row_support_

Get the row_support_ attribute.

before_statistics_

Get the before_statistics_ attribute.

after_statistics_

Get the after_statistics_ attribute.

lexicon_lookup_

Get the lexicon_lookup_ attribute.

Methods

fit(X[, y])

No-op one-shot fitting of the AutoTextCleaner instance.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

get_metadata_routing is not implemented in AutoTextCleaner.

get_params([deep])

Get parameters for this instance.

partial_fit(X[, y])

No-op batch-wise fitting of the AutoTextCleaner instance.

score(X[, y])

No-op score method.

set_params(**params)

Set the parameters of an instance or a nested instance.

transform(X[, copy])

Process the data as per the parameters.

Notes

Type Aliases

PythonTypes:

Sequence[str] | set[str] | Sequence[Sequence[str]]

NumpyTypes:

numpy.ndarray[str]

PandasTypes:

pandas.Series | pandas.DataFrame

PolarsTypes:

polars.Series | polars.DataFrame

XContainer:

PythonTypes | NumpyTypes | PandasTypes | PolarsTypes

XWipContainer:

list[str] | list[list[str]]

ReturnDimType:

None | Literal[1, 2]

FindType:

str | re.Pattern[str]

SubstituteType:

str | Callable[[str], str]

PairType:

tuple[FindType, SubstituteType]

ReplaceType:

None | PairType | tuple[PairType, …]

RemoveType:

None | FindType | tuple[FindType, …]

class LexiconLookupType(TypedDict):

update_lexicon: NotRequired[bool]

skip_numbers: NotRequired[bool]

auto_split: NotRequired[bool]

auto_add_to_lexicon: NotRequired[bool]

auto_delete: NotRequired[bool]

DELETE_ALWAYS: NotRequired[Sequence[str | re.Pattern[str]] | None]

REPLACE_ALWAYS: NotRequired[dict[str | re.Pattern[str], str] | None]

SKIP_ALWAYS: NotRequired[Sequence[str | re.Pattern[str]] | None]

SPLIT_ALWAYS: NotRequired[dict[str | re.Pattern[str], Sequence[str]] | None]

remove_empty_rows: NotRequired[bool]

verbose: NotRequired[bool]

class NGramsType(TypedDict):

ngrams: Required[Sequence[Sequence[FindType]]] wrap: Required[bool]

class GetStatisticsType(TypedDict):

before: Required[None | bool] after: Required[None | bool]

Examples

>>> from pybear.feature_extraction.text import AutoTextCleaner as ATC
>>> import re
>>>
>>> Trfm = ATC(case_sensitive=False, strip=True, remove_empty_rows=True,
...     replace=(re.compile('[^a-z]'), ''), remove='', normalize=True,
...     global_sep=' ', get_statistics={'before': None, 'after':False},
...     lexicon_lookup={'auto_delete':True}, justify=30)
>>> X = [
...       r' /033[91](tHis)i@s# S@o#/033[0m$e$tERR#I>B<Le te.X@t###dAtA. ',
...       r'@c.lE1123,AnIt up R3eal33nIcE-|-|-|- sEewHat it$S$a$ys$>>>>>>',
...       r'   *f%^&*()%^q*()%^&*m%^&*(l%^&*r%^&r($%^,m,*($9^&@*$%^&*$%^&',
...       r'    (p[^rOb]A.bL(y)N0OtH1InG I1Mp-oRt-Ant.iT" "nEvEr1is1!1!    ',
...     ]
>>> out = Trfm.transform(X)
>>> for line in out:
...     print(line)
THIS IS SOME TERRIBLE TEXT
DATA CLEAN IT UP REAL NICE
SEE WHAT IT SAYS PROBABLY
NOTHING IMPORTANT IT NEVER IS
property after_statistics_#

Get the after_statistics_ attribute.

If the ‘after’ key of the get_statistics parameter has a value of True or False, then statistics about the transformed data were compiled in a TextStatistics instance after the transformation. This exposes that TextStatistics class (which is different from the before_statistics_ TextStatistics class.) The exposed class has attributes that contain information about the transformed data. See the documentation for TextStatistics to learn about what attributes are exposed. The statistics in this attribute are reset when the AutoTextCleaner instance is reset by calls to fit().

Returns:
_after_statisticsinstance TextStatistics

A TextStatistics instance that contains statistics about the processed data after the transformation.

property before_statistics_#

Get the before_statistics_ attribute.

If the ‘before’ key of the get_statistics parameter has a value of True or False, then statistics about the raw data were compiled in a TextStatistics instance before the transformation. This exposes that TextStatistics class (which is different from the after_statistics_ TextStatistics class.) The exposed class has attributes that contain information about the raw data. See the documentation for TextStatistics to learn about what attributes are exposed. The statistics in this attribute are reset when the AutoTextCleaner instance is reset by calls to fit().

Returns:
_before_statisticsinstance TextStatistics

A TextStatistics instance that contains statistics about the raw data before the transformation.

fit(X, y=None)#

No-op one-shot fitting of the AutoTextCleaner instance.

Parameters:
XXContainer

The 1D or (possibly ragged) 2D text data. Ignored.

yAny, default=None

The target for the data. Always ignored.

Returns:
selfobject

The AutoTextCleaner instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Required. The data.

yarray_like of shape (n_samples, n_outputs) or (n_samples,)

Optional, default=None. Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_trarray_like of shape (n_samples, n_features_new)

Transformed array.

get_metadata_routing()#

get_metadata_routing is not implemented in AutoTextCleaner.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:
deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:
paramsdict[str, Any]

Parameter names mapped to their values.

property lexicon_lookup_#

Get the lexicon_lookup_ attribute.

If the lexicon_lookup parameter has a non-None value, then information about the text-lookup process is stored in a TextLookupRealTime (TLRT) instance within ATC. This attribute exposes that TLRT class, which has attributes that contain information about the handling of words not in the pybear Lexicon. If you ran lexicon_lookup in manual mode, you may have put a lot of effort into handing the unknown words and you want access to the information. You may have instructed TLRT to queue words that you want to add to the Lexicon so that you can access them later and put them in the Lexicon. See the documentation for TextLookupRealTime to learn about what attributes are exposed. The information in TLRT is reset when AutoTextCleaner is reset by calls to fit().

Returns:
_lexicon_lookupinstance TextLookupRealTime

A TextLookupRealTime instance that contains information about the text-lookup process, if applicable.

property n_rows_#

Get the n_rows_ attribute.

The total number of rows in data passed to transform() between resets. This may not be the number of rows in the outputted data. Unlike most other pybear text transformers that expose an n_rows_ attribute that is not cumulative, this particular attribute is cumulative across multiple calls to transform. The reason for the different behavior is that the cumulative behavior here aligns this attribute with the behavior of before_statistics_ and after_statistics_, which compile statistics cumulatively across multiple calls to transform. This number is reset when the AutoTextCleaner instance is reset by calls to fit().

Returns:
_n_rows: int

The total number of rows seen by AutoTextCleaner.

partial_fit(X, y=None)#

No-op batch-wise fitting of the AutoTextCleaner instance.

Parameters:
XXContainer

The 1D or (possibly ragged) 2D text data. Ignored.

yAny, default=None

The target for the data. Always ignored.

Returns:
selfobject

The AutoTextCleaner instance.

property row_support_#

Get the row_support_ attribute.

A 1D boolean numpy vector indicating which rows of the data, if any, were removed during the cleaning process. The length must equal the number of rows in the data originally passed to transform(). A row that was removed is indicated by a False in the corresponding position in the vector, and a row that remains is indicated by True. This attribute only reflects the last batch of data passed to transform; it is not cumulative. This attribute is not available if ATC parameter justify is enabled.

Returns:
_row_supportnp.ndarray[bool]

A 1D boolean numpy vector indicating which rows of the data, if any, were removed during the cleaning process.

score(X, y=None)#

No-op score method.

Parameters:
XAny

The data. Ignored.

yAny, default=None

The target for the data. Ignored.

Returns:
None
set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:
**paramsdict[str: Any]

The parameters to be updated and their new values.

Returns:
selfobject

The instance with new parameter values.

transform(X, copy=False)#

Process the data as per the parameters.

Parameters:
XXContainer

The 1D or (possibly ragged) 2D text data.

copybool, default=False

Whether to perform the text cleaning operations directly on the passed X or on a deepcopy of X.

Returns:
X_trlist[str] | list[list[str]]

The processed data.