TextLookupRealTime#

class pybear.feature_extraction.text.TextLookupRealTime(*, update_lexicon=False, skip_numbers=True, auto_split=True, auto_add_to_lexicon=False, auto_delete=False, DELETE_ALWAYS=None, REPLACE_ALWAYS=None, SKIP_ALWAYS=None, SPLIT_ALWAYS=None, remove_empty_rows=False, verbose=False)#

Bases: _TextLookupMixin

Handle words in a 2D array-like body of text that are not in the pybear Lexicon.

Options include replacing, removing, splitting, or skipping the word, or staging it to add to the pybear Lexicon.

TextLookupRealTime (TLRT) has a dual-functionality transform() method. TLRT can operate autonomously on your data for a completely hands-free experience, or can be driven in a fully interactive transform process. The interactive mode is a menu-driven process that prompts the user for a decision about a word that is not in the Lexicon and makes the edits to the data in real time (hence the name.)

The main benefit of having a real-time interactive mode is when you have data that has a lot of words that are not in the Lexicon. Manually cleaning text is a labor-intensive process that can require a lot of time and effort and there is always the risk of losing your work. TLRT will ask you in-situ after every 20 manual edits if you want to save your work to the hard drive. So if your session is disrupted at some point midstream, you won’t lose all of your work.

That aspect is the key difference between TLRT and TextLookup (TL), that TLRT works on your data in real time, meaning that the data is modified in-situ immediately when you indicate an action. TL is a more conventional scikit-style transformer in that the learning that takes place for both autonomous and manual modes happens in partial_fit / fit, information is stored in ‘holder’ attributes, and then that information is applied blindly to any data that is passed to transform. TL does not mutate your data during fitting, so the changes to your data do not happen in ‘real time’. Because of this temporal dynamic, TL is not able to save changes to your data in-situ. If you log a lot of changes to your data during partial_fit / fit and then the program terminates for whatever reason, you lose all your work. TLRT affords you the opportunity to save your work in-situ, making your changes permanent. Another benefit of operating directly on the data in-situ, unlike TL, is that you can perform a different operation on each occurrence of a particular word.

To run TLRT in autonomous mode, set either auto_delete or auto_add_to_lexicon to True; both cannot simultaneously be True. auto_add_to_lexicon can only be True if the update_lexicon parameter is True.

When auto_add_to_lexicon is True, if TLRT encounters a word that is not in the Lexicon it will automatically stage the word in the LEXICON_ADDENDUM_ and go to the next word until all the words in the text are exhausted. When auto_delete is True, if TLRT encounters a word that is not in the Lexicon, it will automatically delete the word from the text body and go to the next word, until all the words in the text are exhausted. In these cases, TLRT can never proceed into manual mode. To allow TLRT to go into manual mode, both auto_delete and auto_add_to_lexicon must be False.

In manual mode, when TLRT encounters a word that is not in the Lexicon, the user will be prompted with an interactive menu for an action. Choices include: ‘skip once’, ‘skip always’, ‘delete once’, ‘delete always’, ‘replace once’, ‘replace always’, ‘split once’, ‘split always’, and if update_lexicon is True, an ‘add to lexicon’ option. Notice that the operations can be split into 2 groups, the ‘once’ group and the ‘always’ group. The ‘once’ group is a one time operation on that word. TLRT will not remember what to do the next time it sees this exact word. If you choose something from the ‘always’ group, the word and its action go into a ‘holder’ object so that TLRT remembers what to do next time it sees the word. In this way, a tedious interactive session can become more automated as the session proceeds.

The holder objects are all accessible attributes in the TLRT public API. See the Attributes section for more details. These holder objects can also be passed at instantiation to give TLRT a head-start on words that aren’t in the Lexicon and helps make a manual session more automated. Let’s say, for example, that you know that your text is full of some proper names that aren’t in the Lexicon, and you don’t want to add them permanently, and you don’t want to have to always tell TLRT what to do with these words when they come up. You decide that you want to leave them in the text body and have TLRT ignore them. At instantiation pass a list of these strings to the SKIP_ALWAYS parameter. So you might pass [‘ALICE’, ‘BOB’, ‘CARL’, ‘DIANE’,…] to SKIP_ALWAYS. TLRT will always skip these words without asking. The passed SKIP_ALWAYS becomes the starting seed of the SKIP_ALWAYS_ attribute. Any other manual inputs during the session that say to always skip certain other words will be added to this list, so that at the end of the session the SKIP_ALWAYS_ attribute will contain your originally passed words and the words added during the session.

TLRT always looks for special instructions before looking to see if a word is in the Lexicon. Otherwise, if TLRT checked the word against the Lexicon first and the word is in it, TLRT would go to the next word automatically. Doing it in this way allows for users to give special instructions for words already in the Lexicon. Let’s say there is a word in the Lexicon but you want to delete it from your text. You could pass it to DELETE_ALWAYS and TLRT will remove it regardless of what the Lexicon says.

The update_lexicon parameter does not cause TLRT to directly update the Lexicon. If the user opts to stage a word for addition to the Lexicon, the word is added to the LEXICON_ADDENDUM_ attribute. This is a deliberate design choice to stage the words rather than silently modify the Lexicon. This gives the user a layer of protection where they can review the words staged to go into the Lexicon, make any changes needed, then manually pass them to the Lexicon add_words method.

TLRT requires (possibly ragged) 2D data formats. Accepted objects include python built-in lists and tuples, numpy arrays, pandas dataframes, and polars dataframes. Use pybear TextSplitter to convert 1D text to 2D tokens. Results are always returned as a 2D python list of lists of strings.

Your data should be in a highly processed state before using TLRT. This should be one of the last steps in a text wrangling workflow because the content of your text will be compared directly against the words in the Lexicon, and all the words in the pybear Lexicon have no non-alpha characters and are all majuscule. All junk characters should be removed and clear separators established. A pybear text wrangling workflow might look like: TextStripper > TextReplacer > TextSplitter > TextNormalizer > TextRemover > TextLookupRealTime > TextJoiner > TextJustifier

Every operation in TLRT is case-sensitive. Remember that the formal pybear Lexicon is majuscule; use pybear TextNormalizer to make all your text majuscule before using TLRT. Otherwise, TLRT will always flag every valid word that is not majuscule because it doesn’t exactly match the Lexicon. If you alter your local copy of the pybear Lexicon with your own words of varying capitalization, TLRT honors your capitalization scheme.

When you are in the manual text lookup process and are entering words at the prompts to replace unknown words in your text, whatever is entered is inserted into your text exactly as entered by you. You must enter the text exactly as you want it in the cleaned output. If normalizing the text is important to you, you must enter the text in the case that you want in the output, TLRT will not do it for you.

TLRT is a full-fledged scikit-style transformer. It has fully functional get_params, set_params, transform, and fit_transform methods. It also has partial_fit, fit, and score methods, which are no-ops. TLRT technically does not need to be fit for 2 reasons. First, in autonomous mode, TLRT already knows everything it needs to do transformations from the parameters and the Lexicon. Secondly, in manual mode the user interacts with the data during transform, not partial_fit / fit. These no-op methods are available to fulfill the scikit transformer API and make TLRT suitable for incorporation into larger workflows, such as Pipelines and dask_ml wrappers.

Because TLRT doesn’t need any information from partial_fit / fit , it is technically always in a ‘fitted’ state and ready to transform data. Checks for fittedness will always return True.

TLRT exposes 2 attributes after a call to transform. First, the n_rows_ attribute is the number of rows of text seen in the original data but is not necessarily the number of rows in the outputted data. TLRT also has a row_support_ attribute that is a boolean vector of shape (n_rows, ) that indicates which rows of the original data were kept during the transform process (True) and which were deleted (False). The only way that an entry could become False is if remove_empty_rows is True and a row becomes empty when handling unknown words. row_support_ only reflects the last dataset passed to transform.

Parameters:

update_lexiconbool, default = False: Whether to queue words that are not in the pybear Lexicon for later addition to the Lexicon. This applies to both autonomous and interactive modes. If False, TLRT will never put a word in LEXICON_ADDENDUM_ and will never prompt you with the option.
skip_numbersbool, default = True: When True, TLRT will try to do Python float(word) on the word and, if it can be cast to a float, TLRT will skip it and go to the next word. If False, TLRT will handle it like any other word. There are no numbers in the formal pybear Lexicon so TLRT will always flag them and handle them autonomously or prompt the user for an action. Since they are handled like any other word, it would be possible to stage them for addition to your local copy of the Lexicon.
auto_splitbool, default = True: TLRT will first look if the word is in any of the holder objects for special instructions, then look to see if the word is in the Lexicon. If not, the next step otherwise would be auto-add to Lexicon, auto-delete, or go into manual mode. This functionality is a last-ditch effort to see if a word is an erroneous compounding of 2 words that are in the Lexicon. If auto_split is True, TLRT will iteratively split any word of 4 or more characters from after the second character to before the second to last character and see if both halves are in the Lexicon. When/if the first match is found, TLRT will remove the original word, split it, and insert in the original place the 2 halves that were found to be in the Lexicon. If False, TLRT will skip this process and go straight to auto-add, auto-delete, or manual mode.
auto_add_to_lexiconbool, default = False: update_lexicon must be True to use this. Cannot be True if auto_delete is True. When this parameter is True, TLRT operates in ‘auto-mode’, where the user will not be prompted for decisions. When TLRT encounters a word that is not in the Lexicon, the word will silently be staged in the LEXICON_ADDENDUM_ attribute to be added to the Lexicon later.
auto_deletebool, default = False: If update_lexicon is True then this cannot be set to True. When this parameter is True, TLRT operates in ‘auto-mode’, where the user will not be prompted for decisions. When TLRT encounters a word that is not in the Lexicon, the word will be silently deleted from the text body.
DELETE_ALWAYSSequence[MatchType] | None, default = None: A list of words and/or full-word regex patterns that will always be deleted by TLRT, even if they are in the Lexicon. For both auto and manual modes, when a word in the text body is a case-sensitive match against a string literal in this list, or is a full-word match against a regex pattern in this list, TLRT will not prompt the user for any more information, it will silently delete the word. What is passed here becomes the seed for the DELETE_ALWAYS_ attribute, which may have more words added to it during run-time in manual mode. Auto-mode will never add more words to this list.
REPLACE_ALWAYSdict[MatchType, str] | None, default = None: A dictionary with words and/or full-word regex patterns as keys and their respective single-word replacement strings as values. For both auto and manual modes, when a word in the text body is a case-sensitive match against a string literal key, or is a full-word match against a regex pattern key, TLRT will not prompt the user for any more information, it will silently make the replacement. TLRT will replace these words even if they are in the Lexicon. What is passed here becomes the seed for the REPLACE_ALWAYS_ attribute, which may have more word/replacement pairs added to it during run-time in manual mode. Auto-mode will never add more entries to this dictionary.
SKIP_ALWAYSSequence[MatchType] | None, default = None: A list of words and/or full-word regex patterns that will always be ignored by TLRT, even if they are not in the Lexicon. For both auto and manual modes, when a word in the text body is a case-sensitive match against a string literal in this list, or is a full-word match against a regex pattern in this list, TLRT will not prompt the user for any more information, it will silently skip the word. What is passed here becomes the seed for the SKIP_ALWAYS_ attribute, which may have more words added to it during run-time in manual mode. Auto-mode will never add more words to this list.
SPLIT_ALWAYSdict[MatchType, Sequence[str]] | None, default = None: A dictionary with words and/or full-word regex patterns as keys and their respective multi-word lists of replacement strings as values. TLRT will remove the original word and insert these words into the text body starting in its position even if the original word is in the Lexicon. For both auto and manual modes, TLRT will not prompt the user for any more information, it will silently split the word. What is passed here becomes the seed for the SPLIT_ALWAYS_ attribute, which may have more word/replacement pairs added to it during run-time in manual mode. Auto-mode will never add entries to this dictionary.
remove_empty_rowsbool, default = False: Whether to remove any rows that may have been made empty during the lookup process. If remove_empty_rows is True and rows are deleted, the user can find supplemental information in row_support_, which indicates through booleans which rows were kept (True) and which rows were removed (False).
verbosebool, default = False: Whether to display helpful information during the transform process. This applies to both auto and manual modes.

Attributes:

n_rows_: Get the n_rows_ attribute.
row_support_: Get the row_support_ attribute.
DELETE_ALWAYS_: Return the DELETE_ALWAYS_ attribute.
KNOWN_WORDS_: A WIP object used by TL(RT) to determine “what is in the
LEXICON_ADDENDUM_: Words queued for entry into the pybear Lexicon.
REPLACE_ALWAYS_: Return the REPLACE_ALWAYS_ attribute.
SKIP_ALWAYS_: Return the SKIP_ALWAYS_ attribute.
SPLIT_ALWAYS_: Return the SPLIT_ALWAYS_ attribute.

Methods

`dump_to_csv`(X, filename)	Dump X to csv.
`dump_to_txt`(X, filename)	Dump X to txt.
`fit`(X[, y])	No-op one-shot fit method.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_lexicon`()	Create a singleton Lexicon instance as class attribute.
`get_metadata_routing`()	get_metadata_routing is not implemented in TextLookup.
`get_params`([deep])	Get parameters for this instance.
`partial_fit`(X[, y])	No-op batch-wise fit method.
`reset`()	Reset the TextLookup instance.
`score`(X[, y])	No-op score method.
`set_params`(**params)	Set the parameters of an instance or a nested instance.
`transform`(X[, copy])	Scan tokens in X and either autonomously handle tokens not in the pybear Lexicon or prompt for handling.

Notes

When passing regex patterns to DELETE_ALWAYS, REPLACE_ALWAYS, SKIP_ALWAYS, and SPLIT_ALWAYS, the regex patterns must be designed to match full words in the text body and must be passed in re.compile objects. Do not pass regex patterns as string literals, you will not get the correct result. String literals must also be designed to match full words in the text body. You do not need to escape string literals. If the same literal is passed to multiple ‘ALWAYS’ parameters, TLRT will detect this conflict and raise an error. If a word in the text body causes a conflict between a literal and a re.compile object or between two re.compile objects within the same ‘ALWAYS’ parameter, TLRT will raise an error. However, TLRT cannot detect conflicts between re.compile objects across multiple ‘ALWAYS’ parameters, where a word in a text body could possibly be indicated for two different operations, such as SKIP and DELETE. TLRT will not resolve the conflict but will simply perform whichever operation is matched first. The order of match searching within TLRT is SKIP_ALWAYS, DELETE_ALWAYS, REPLACE_ALWAYS, and finally SPLIT_ALWAYS. It is up to the user to avoid these conflict conditions with careful regex pattern design.

Type Aliases

MatchType:: str | re.Pattern[str]
PythonTypes:: Sequence[Sequence[str]]
NumpyTypes:: numpy.ndarray[str]
PandasTypes:: pandas.DataFrame
PolarsTypes:: polars.DataFrame
XContainer:: PythonTypes | NumpyTypes | PandasTypes | PolarsTypes
WipXContainer:: list[list[str]]
RowSupportType:: numpy.ndarray[bool]

Examples

>>> from pybear.feature_extraction.text import TextLookupRealTime as TLRT
>>> trfm = TLRT(skip_numbers=True, auto_delete=True, auto_split=False)
>>> X = [
...    ['FOUR', 'SKORE', '@ND', 'SEVEN', 'YEARS', 'ABO'],
...    ['OUR', 'FATHERS', 'BROUGHT', 'FOTH', 'UPON', 'THIZ', 'CONTINENT'],
...    ['A', 'NEW', 'NETION', 'CONCEIVED', 'IN', 'LOBERTY'],
...    ['AND', 'DEDICORDED', '2', 'THE', 'PRAPISATION'],
...    ['THAT', 'ALL', 'MEESES', 'ARE', 'CREATED', 'EQUEL']
... ]
>>> out = trfm.transform(X)
>>> for i in out:
...     print(i)
['FOUR', 'SEVEN', 'YEARS']
['OUR', 'FATHERS', 'BROUGHT', 'UPON', 'CONTINENT']
['A', 'NEW', 'CONCEIVED', 'IN']
['AND', '2', 'THE']
['THAT', 'ALL', 'ARE', 'CREATED']

property DELETE_ALWAYS_#

Return the DELETE_ALWAYS_ attribute.

A list of words and/or full-word regex patterns that will always be deleted from the text body by TL(RT), even if they are in the Lexicon. This list contains any words and re.compile objects passed to DELETE_ALWAYS at instantiation and any words added in-situ.

Returns:

DELETE_ALWAYS_list[str | re.Pattern[str]]: A list of words and/or full-word regex patterns in re.compile objects that will always be deleted from the text body by TL(RT), even if they are in the Lexicon.

property KNOWN_WORDS_#

A WIP object used by TL(RT) to determine “what is in the Lexicon.”

At instantiation, this is just a copy of the lexicon_ attribute of the pybear Lexicon class. If update_lexicon is True, any words to be added to the Lexicon are inserted at the front of this list (in addition to also being put in LEXICON_ADDENDUM_.) If auto_add_to_lexicon is True, then words are inserted at the front of this list silently during the auto-lookup process. If auto_add_to_lexicon is False, words are inserted into this list if the user selects ‘add to lexicon’.

Returns:

KNOWN_WORDS_list[str]: A WIP object used by TL(RT) to determine “what is in the Lexicon.”

property LEXICON_ADDENDUM_#

Words queued for entry into the pybear Lexicon.

Can only have words in it if update_lexicon is True. If in auto mode (auto_add_to_lexicon is True), anything encountered in the text that is not in the Lexicon is added to this list. In manual mode, if the user selects to ‘add to lexicon’ then the word is put in this list. TL(RT) does not automatically add new words to the actual Lexicon directly. TL(RT) stages new words in LEXICON_ADDENDUM_ and at the end of a session prints them to the screen and makes them available in this attribute.

Returns:

LEXICON_ADDENDUM_list[str]: Words queued for entry into the pybear Lexicon.

property REPLACE_ALWAYS_#

Return the REPLACE_ALWAYS_ attribute.

A dictionary with words and/or full-word regex patterns as keys and their respective single-word replacement strings as values.

TL(RT) will replace these words even if they are in the Lexicon. This holds any words and re.compile objects passed to REPLACE_ALWAYS at instantiation and anything added to it during run-time in manual mode. In manual mode, when the user selects ‘replace always’, the next time TL(RT) sees the word it will not prompt the user for any more information, it will silently replace the word. When in auto mode, TL(RT) will not add any entries to this dictionary.

Returns:

REPLACE_ALWAYS_dict[str | re.Pattern[str], str]: A dictionary with words and/or full-word regex patterns in re.compile objects as keys and their respective single-word replacements as values.

property SKIP_ALWAYS_#

Return the SKIP_ALWAYS_ attribute.

A list of words and/or full-word regex patterns that are always ignored by TL(RT), even if they are not in the Lexicon.

This list holds any words and re.compile objects passed to the SKIP_ALWAYS parameter at instantiation and any words added to it when the user selects ‘skip always’ in manual mode. In manual mode, the next time TL(RT) sees a word that is in this list it will not prompt the user again, it will silently skip the word.

TL will only make additions to this list in auto mode if skip_numbers is True and a number is found in the training data. TLRT does not make additions to this list in auto mode.

Returns:

SKIP_ALWAYS_list[str | re.Pattern[str]]: A list of words and/or full-word regex patterns in re.compile objects that are always ignored by TL(RT), even if they are not in the Lexicon.

property SPLIT_ALWAYS_#

Return the SPLIT_ALWAYS_ attribute.

A dictionary with words and/or full-word regex patterns as keys and their respective multi-word lists of replacements as values.

Similar to REPLACE_ALWAYS_. TL(RT) will sub these words in even if the original word is in the Lexicon. This dictionary holds anything passed to SPLIT_ALWAYS at instantiation and any splits made when ‘split always’ is selected in manual mode. In manual mode, the next time TL(RT) sees the same word in the text body it will not prompt the user again, just silently make the split.

The only way TL will add anything to this dictionary in auto mode is if auto_split is True and TL finds a valid split of an unknown word during partial_fit / fit. TLRT does not add anything to this dictionary in auto mode.

Returns:

SPLIT_ALWAYS_dict[str | re.Pattern[str], Sequence[str]]: A dictionary with words and/or full-word regex patterns in re.compile objects as keys and their respective multi-word lists of replacements as values.

dump_to_csv(X, filename)#

Dump X to csv.

Parameters:

Xlist[str]: The data.
filenamestr: The name for the saved csv file.

Returns:

None

dump_to_txt(X, filename)#

Dump X to txt.

Parameters:

Xlist[str]: The data.
filenamestr: The name for the saved txt file.

Returns:

None

fit(X, y=None)#

No-op one-shot fit method.

Parameters:

XXContainer: The (possibly ragged) 2D container of text to have its contents cross-referenced against the pybear Lexicon. Ignored.
yAny | None, default=None: The target for the data. Always ignored.

Returns:

selfobject: The TextLookupRealTime instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

Xarray_like of shape (n_samples, n_features): Required. The data.
yarray_like of shape (n_samples, n_outputs) or (n_samples,): Optional, default=None. Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns:

X_trarray_like of shape (n_samples, n_features_new): Transformed array.

classmethod get_lexicon()#: Create a singleton Lexicon instance as class attribute.

get_metadata_routing()#: get_metadata_routing is not implemented in TextLookup.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:

deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:

paramsdict[str, Any]: Parameter names mapped to their values.

property n_rows_#

Get the n_rows_ attribute.

The number of rows in the last data passed to transform(). Not necessarily the number of rows in the outputted data.

Returns:

n_rows_int: The number of rows in the last data passed to transform.

partial_fit(X, y=None)#

No-op batch-wise fit method.

Parameters:

XXContainer: The (possibly ragged) 2D container of text to have its contents cross-referenced against the pybear Lexicon. Ignored.
yAny, default=None: The target for the data. Always ignored.

Returns:

selfobject: The TextLookupRealTime instance.

reset()#

Reset the TextLookup instance.

This will remove all attributes that are exposed during transform.

Returns:

selfobject: Thr reset TextLookup instance.

property row_support_#

Get the row_support_ attribute.

A 1D boolean vector of shape (n_rows_, ) that indicates which rows were kept in the data during transform. Only available if a transform has been performed, and only reflects the last dataset passed to transform().

Returns:

row_support_numpy.ndarray[bool] of shape (n_rows_, ): A boolean vector that indicates which rows were kept in the data during transform.

score(X, y=None)#

No-op score method.

Needs to be here for dask_ml wrappers.

Parameters:

XAny: The data. Ignored.
yAny, default = None: The target for the data. Ignored.

Returns:

None

set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:

**paramsdict[str: Any]: The parameters to be updated and their new values.

Returns:

selfobject: The instance with new parameter values.

transform(X, copy=False)#

Scan tokens in X and either autonomously handle tokens not in the pybear Lexicon or prompt for handling.

Parameters:

XXContainer: The data in (possibly ragged) 2D array-like format.
copybool, default=False: Whether to make substitutions and deletions directly on the passed X or a deepcopy of X.

Returns:

X_trlist[list[str]]: The data with user-entered, auto-replaced, or deleted tokens in place of tokens not in the pybear Lexicon.