TextLookup#

class pybear.feature_extraction.text.TextLookup(*, update_lexicon=False, skip_numbers=True, auto_split=True, auto_add_to_lexicon=False, auto_delete=False, DELETE_ALWAYS=None, REPLACE_ALWAYS=None, SKIP_ALWAYS=None, SPLIT_ALWAYS=None, remove_empty_rows=False, verbose=False)#

Bases: _TextLookupMixin

Handle words in a 2D array-like body of text that are not in the pybear Lexicon.

Options include replacing, removing, splitting, or skipping the word, or staging it to add to the pybear Lexicon.

TextLookup (TL) has a dual-functionality partial_fit() method. TL can operate autonomously on your data for a completely hands-free experience, or can be driven in a fully interactive process. The interactive mode is a menu-driven process that prompts the user for a decision about a word that is not in the Lexicon and lazily stores edits to the data to be applied later during transform.

TL has a sister module called TextLookupRealTime which is not a typical scikit-style transformer. TL is more conventional in that the learning that takes place for both autonomous and manual modes happens in partial_fit() and fit(), information is stored in ‘holder’ attributes, and then that information is applied blindly to any data that is passed to transform(). TL does not mutate your data during fitting, so the changes to your data do not happen in ‘real time’. Because of this temporal dynamic, TL is not able to save changes to your data in-situ. If you log a lot of changes to your data in partial_fit or fit and then the program terminates for whatever reason, you lose all your work. If you want to make edits to the data in real time and be able to save your changes in-situ then use TextLookupRealTime.

To run TL in autonomous mode, set either auto_delete or auto_add_to_lexicon to True; both cannot simultaneously be True. auto_add_to_lexicon can only be True if the update_lexicon parameter is True.

When auto_add_to_lexicon is True, if TL encounters a word that is not in the Lexicon it will automatically stage the word in the LEXICON_ADDENDUM_ and go to the next word until all the words in the text are exhausted. When auto_delete is True, if TL encounters a word that is not in the Lexicon, it will silently add the word to the DELETE_ALWAYS_ attribute and go to the next word, until all the words in the text are exhausted. In these cases, TL can never proceed into manual mode. To allow TL to go into manual mode, both auto_delete and auto_add_to_lexicon must be False.

In manual mode, when TL encounters a word that is not in the Lexicon, the user will be prompted with an interactive menu for an action. Choices that are always presented include: ‘skip always’, ‘delete always’, ‘replace always’, and ‘split always’. Conditionally, if update_lexicon is True, an ‘add to lexicon’ option is also presented. If you choose something from the ‘always’ group, the word goes into a ‘holder’ object for the selected action so that TL knows how to handle it during transform. TL does not have the ability to handle the same word in different ways at different times. Whatever instruction is selected for one occurrence of a word must be applied to all occurrences of the word because the storage mechanisms for the operation/word combinations do not track the exact locations of individual words.

The holder objects are all accessible attributes in the TL public API. See the Attributes section for more details. These holder objects can also be passed at instantiation to give TL a head-start on words that aren’t in the Lexicon and helps make a manual session more automated. Let’s say, for example, that you know that your text is full of some proper names that aren’t in the Lexicon, and you don’t want to add them permanently, and you don’t want to have to always tell TL what to do with these words when they come up. You decide that you want to leave them in the text body and have TL ignore them. At instantiation pass a list of these strings to the SKIP_ALWAYS parameter. So you might pass [‘ZEKE’, ‘YVONNE’, ‘XAVIER’,…] to SKIP_ALWAYS. TL will always skip these words without asking. The passed SKIP_ALWAYS becomes the starting seed of the SKIP_ALWAYS_ attribute. Any other manual inputs during the session that say to always skip certain other words will be added to this list, so that at the end of the session the SKIP_ALWAYS_ attribute will contain your originally passed words and the words added during the session.

TL always looks for special instructions before looking to see if a word is in the Lexicon. Otherwise, if TL checked the word against the Lexicon first and the word is in it, TL would go to the next word automatically. Doing it in this way allows for users to give special instructions for words already in the Lexicon. Let’s say there is a word in the Lexicon but you want to delete it from your text. You could pass it to DELETE_ALWAYS and TL will remove it regardless of what the Lexicon says.

The update_lexicon parameter does not cause TL to directly update the Lexicon. If the user opts to stage a word for addition to the Lexicon, the word is added to the LEXICON_ADDENDUM_ attribute. This is a deliberate design choice to stage the words rather than silently modify the Lexicon. This gives the user a layer of protection where they can review the words staged to go into the Lexicon, make any changes needed, then manually pass them to the Lexicon add_words method.

TL requires (possibly ragged) 2D data formats. Accepted objects include python built-in lists and tuples, numpy arrays, pandas dataframes, and polars dataframes. Use pybear TextSplitter to convert 1D text to 2D tokens. Results are always returned as a 2D python list of lists of strings.

Your data should be in a highly processed state before using TL. This should be one of the last steps in a text wrangling workflow because the content of your text will be compared directly against the words in the Lexicon, and all the words in the pybear Lexicon have no non-alpha characters and are all majuscule. All junk characters should be removed and clear separators established. A pybear text wrangling workflow might look like: TextStripper > TextReplacer > TextSplitter > TextNormalizer > TextRemover > TextLookup > StopRemover > TextJoiner > TextJustifier

Every operation in TL is case-sensitive. Remember that the formal pybear Lexicon is majuscule; use pybear TextNormalizer to make all your text majuscule before using TL. Otherwise, TL will always flag every valid word that is not majuscule because it doesn’t exactly match the Lexicon. If you alter your local copy of the pybear Lexicon with your own words of varying capitalization, TL honors your capitalization scheme.

When you are in the manual text lookup process and are entering words at the prompts to replace unknown words in your text, whatever is entered is inserted into your text exactly as entered by you. You must enter the text exactly as you want it in the cleaned output. If normalizing the text is important to you, you must enter the text in the case that you want in the output, TL will not do it for you.

If TL encounters a word during transform that was not seen during fitting and it is not in the Lexicon, the way that it is handled depends on the setting of the auto_delete parameter. If auto_delete is True, the word is deleted from the text body. If False, the word is skipped. In both cases, the word is added to an OOV_ (out of vocabulary) dictionary. OOV_ is only available after data has been passed to transform. The keys are the out-of-vocabulary words and the values are the frequency of each unseen word.

TL is a full-fledged scikit-style transformer. It has fully functional get_params, set_params, partial_fit, fit, transform, and fit_transform methods. It also has a no-op score method that allows TL to be wrapped by dask_ml wrappers, on the off-chance that you actually have text data in dask format.

TL has an n_rows_ attribute which is only available after data has been passed to partial_fit() or fit(). It is the total number of rows of text seen in the original data and is not necessarily the total number of rows in the outputted data. TL also has a row_support_ attribute that is a boolean vector that indicates which rows of the most-recently transformed data were kept during the transform process (True) and which were deleted (False). The only way that an entry could become False is if the remove_empty_rows parameter is True and a row becomes empty when handling unknown words. row_support_ is only available after something has been passed to transform, and only reflects the last dataset passed to transform.

Parameters:
update_lexiconbool, default = False

Whether to queue words that are not in the pybear Lexicon for later addition to the Lexicon. This applies to both autonomous and interactive modes. If False, TL will never put a word in LEXICON_ADDENDUM_ and will never prompt you with the option.

skip_numbersbool, default = True

When True, TL will try to do Python float(word) on the word and, if it can be cast to a float, TL will skip it and go to the next word. If False, TL will handle it like any other word. There are no numbers in the formal pybear Lexicon so TL will always flag them and handle them autonomously or prompt the user for an action. Since they are handled like any other word, it would be possible to stage them for addition to your local copy of the Lexicon.

auto_splitbool, default = True

TL will first look if the word is in any of the holder objects for special instructions, then look to see if the word is in the Lexicon. If not, the next step otherwise would be auto-add to Lexicon, auto-delete, or go into manual mode. This functionality is a last-ditch effort to see if a word is an erroneous compounding of 2 words that are in the Lexicon. If auto_split is True, TL will iteratively split any word of 4 or more characters from after the second character to before the second to last character and see if both halves are in the Lexicon. When/if the first match is found, TL will remove the original word, split it, and insert in the original place the 2 halves that were found to be in the Lexicon. If False, TL will skip this process and go straight to auto-add, auto-delete, or manual mode.

auto_add_to_lexiconbool, default = False

update_lexicon must be True to use this. Cannot be True if auto_delete is True. When this parameter is True, TL operates in ‘auto-mode’, where the user will not be prompted for decisions. When TL encounters a word that is not in the Lexicon, the word will silently be staged in the LEXICON_ADDENDUM_ attribute to be added to the Lexicon later.

auto_deletebool, default = False

If update_lexicon is True then this cannot be set to True. When this parameter is True, TL operates in ‘auto-mode’, where the user will not be prompted for decisions. When TL encounters a word that is not in the Lexicon, the word will be silently deleted from the text body.

DELETE_ALWAYSSequence[MatchType] | None, default = None

A list of words and/or full-word regex patterns that will always be deleted by TL, even if they are in the Lexicon. For both auto and manual modes, when a word in the text body is a case-sensitive match against a string literal in this list, or is a full-word match against a regex pattern in this list, TL will not prompt the user for any more information, it will silently delete the word. What is passed here becomes the seed for the DELETE_ALWAYS_ attribute, which may have more words added to it during run-time in auto and manual modes.

REPLACE_ALWAYSdict[MatchType, str] | None, default = None

A dictionary with words and/or full-word regex patterns as keys and their respective single-word replacement strings as values. For both auto and manual modes, when a word in the text body is a case-sensitive match against a string literal key, or is a full-word match against a regex pattern key, TL will not prompt the user for any more information, it will silently make the replacement. TL will replace these words even if they are in the Lexicon. What is passed here becomes the seed for the REPLACE_ALWAYS_ attribute, which may have more word/replacement pairs added to it during run-time in manual mode. Auto-mode will never add more entries to this dictionary.

SKIP_ALWAYSSequence[MatchType] | None, default = None

A list of words and/or full-word regex patterns that will always be ignored by TL, even if they are not in the Lexicon. For both auto and manual modes, when a word in the text body is a case-sensitive match against a string literal in this list, or is a full-word match against a regex pattern in this list, TL will not prompt the user for any more information, it will silently skip the word. What is passed here becomes the seed for the SKIP_ALWAYS_ attribute, which may have more words added to it during run-time in manual mode. Auto-mode will only add entries to this list if ignore_numbers is True and TL finds a number during partial_fit / fit.

SPLIT_ALWAYSdict[MatchType, Sequence[str]] | None, default = None

A dictionary with words and/or full-word regex patterns as keys and their respective multi-word lists of replacement strings as values. TL will remove the original word and insert these words into the text body starting in its position even if the original word is in the Lexicon. For both auto and manual modes, TL will not prompt the user for any more information, it will silently split the word. What is passed here becomes the seed for the SPLIT_ALWAYS_ attribute, which may have more word/replacement pairs added to it during run-time in manual mode. Auto-mode will only add entries to this dictionary if auto_split is True and TL finds a valid split for an unknown word.

remove_empty_rowsbool, default = False

Whether to remove any rows that may have been made empty during the lookup process. If remove_empty_rows is True and rows are deleted, the user can find supplemental information in row_support_, which indicates through booleans which rows were kept (True) and which rows were removed (False).

verbosebool, default = False

Whether to display helpful information during the transform process. This applies to both auto and manual modes.

Attributes:
n_rows_

Get the n_rows_ attribute.

row_support_

Get the row_support_ attribute.

DELETE_ALWAYS_

Return the DELETE_ALWAYS_ attribute.

KNOWN_WORDS_

A WIP object used by TL(RT) to determine “what is in the

LEXICON_ADDENDUM_

Words queued for entry into the pybear Lexicon.

REPLACE_ALWAYS_

Return the REPLACE_ALWAYS_ attribute.

SKIP_ALWAYS_

Return the SKIP_ALWAYS_ attribute.

SPLIT_ALWAYS_

Return the SPLIT_ALWAYS_ attribute.

OOV_

Get the OOV_ attribute.

Methods

dump_to_csv(X, filename)

Dump X to csv.

dump_to_txt(X, filename)

Dump X to txt.

fit(X[, y])

One-shot fit method.

fit_transform(X[, y])

Fit to data, then transform it.

get_lexicon()

Create a singleton Lexicon instance as class attribute.

get_metadata_routing()

get_metadata_routing is not implemented in TextLookup.

get_params([deep])

Get parameters for this instance.

partial_fit(X[, y])

Batch-wise fit method.

reset()

Reset the TextLookup instance.

score(X[, y])

No-op score method.

set_params(**params)

Set the parameters of an instance or a nested instance.

transform(X[, copy])

Apply the handling learned in partial_fit / fit to X.

Notes

When passing regex patterns to DELETE_ALWAYS, REPLACE_ALWAYS, SKIP_ALWAYS, and SPLIT_ALWAYS, the regex patterns must be designed to match full words in the text body and must be passed in re.compile objects. Do not pass regex patterns as string literals, you will not get the correct result. String literals must also be designed to match full words in the text body. You do not need to escape string literals. If the same literal is passed to multiple ‘ALWAYS’ parameters, TL will detect this conflict and raise an error. If a word in the text body causes a conflict between a literal and a re.compile object or between two re.compile objects within the same ‘ALWAYS’ parameter, TL will raise an error. However, TL cannot detect conflicts between re.compile objects across multiple ‘ALWAYS’ parameters, where a word in a text body could possibly be indicated for two different operations, such as SKIP and DELETE. TL will not resolve the conflict but will simply perform whichever operation is matched first. The order of match searching within TL is SKIP_ALWAYS, DELETE_ALWAYS, REPLACE_ALWAYS, and finally SPLIT_ALWAYS. It is up to the user to avoid these conflict conditions with careful regex pattern design.

Type Aliases

MatchType:

str | re.Pattern[str]

PythonTypes:

Sequence[Sequence[str]]

NumpyTypes:

numpy.ndarray[str]

PandasTypes:

pandas.DataFrame

PolarsTypes:

polars.DataFrame

XContainer:

PythonTypes | NumpyTypes | PandasTypes | PolarsTypes

WipXContainer:

list[list[str]]

RowSupportType:

numpy.ndarray[bool]

Examples

>>> from pybear.feature_extraction.text import TextLookup as TL
>>> trfm = TL(skip_numbers=False, auto_delete=True, auto_split=False)
>>> X = [
...    ['FOUR', 'SKORE', '@ND', 'SEVEN', 'YEARS', 'ABO'],
...    ['OUR', 'FATHERS', 'BROUGHT', 'FOTH', 'UPON', 'THIZ', 'CONTINENT'],
...    ['A', 'NEW', 'NETION', 'CONCEIVED', 'IN', 'LOBERTY'],
...    ['AND', 'DEDICORDED', '2', 'THE', 'PRAPISATION'],
...    ['THAT', 'ALL', 'MEESES', 'ARE', 'CREATED', 'EQUEL']
... ]
>>> out = trfm.fit_transform(X)
>>> for i in out:
...     print(i)
['FOUR', 'SEVEN', 'YEARS']
['OUR', 'FATHERS', 'BROUGHT', 'UPON', 'CONTINENT']
['A', 'NEW', 'CONCEIVED', 'IN']
['AND', 'THE']
['THAT', 'ALL', 'ARE', 'CREATED']
property DELETE_ALWAYS_#

Return the DELETE_ALWAYS_ attribute.

A list of words and/or full-word regex patterns that will always be deleted from the text body by TL(RT), even if they are in the Lexicon. This list contains any words and re.compile objects passed to DELETE_ALWAYS at instantiation and any words added in-situ.

Returns:
DELETE_ALWAYS_list[str | re.Pattern[str]]

A list of words and/or full-word regex patterns in re.compile objects that will always be deleted from the text body by TL(RT), even if they are in the Lexicon.

property KNOWN_WORDS_#

A WIP object used by TL(RT) to determine “what is in the Lexicon.”

At instantiation, this is just a copy of the lexicon_ attribute of the pybear Lexicon class. If update_lexicon is True, any words to be added to the Lexicon are inserted at the front of this list (in addition to also being put in LEXICON_ADDENDUM_.) If auto_add_to_lexicon is True, then words are inserted at the front of this list silently during the auto-lookup process. If auto_add_to_lexicon is False, words are inserted into this list if the user selects ‘add to lexicon’.

Returns:
KNOWN_WORDS_list[str]

A WIP object used by TL(RT) to determine “what is in the Lexicon.”

property LEXICON_ADDENDUM_#

Words queued for entry into the pybear Lexicon.

Can only have words in it if update_lexicon is True. If in auto mode (auto_add_to_lexicon is True), anything encountered in the text that is not in the Lexicon is added to this list. In manual mode, if the user selects to ‘add to lexicon’ then the word is put in this list. TL(RT) does not automatically add new words to the actual Lexicon directly. TL(RT) stages new words in LEXICON_ADDENDUM_ and at the end of a session prints them to the screen and makes them available in this attribute.

Returns:
LEXICON_ADDENDUM_list[str]

Words queued for entry into the pybear Lexicon.

property OOV_#

Get the OOV_ attribute.

Access “Out-of-vocabulary” words that were found during transform but were not seen during fitting. If data that was not seen during partial_fit / fit is passed to transform(), there is the possibility that there are strings that were not previously seen. In this case, TL will not do any more learning and will not prompt for anything from the user. If auto_delete is True, TL will delete this new word; if False, the word is skipped. In both cases, TL will always add all unseen strings as keys in this dictionary. The values are the frequency of each respective string.

Returns:
OOV_dict[str, int]

Out-of-vocabulary words found during transform.

property REPLACE_ALWAYS_#

Return the REPLACE_ALWAYS_ attribute.

A dictionary with words and/or full-word regex patterns as keys and their respective single-word replacement strings as values.

TL(RT) will replace these words even if they are in the Lexicon. This holds any words and re.compile objects passed to REPLACE_ALWAYS at instantiation and anything added to it during run-time in manual mode. In manual mode, when the user selects ‘replace always’, the next time TL(RT) sees the word it will not prompt the user for any more information, it will silently replace the word. When in auto mode, TL(RT) will not add any entries to this dictionary.

Returns:
REPLACE_ALWAYS_dict[str | re.Pattern[str], str]

A dictionary with words and/or full-word regex patterns in re.compile objects as keys and their respective single-word replacements as values.

property SKIP_ALWAYS_#

Return the SKIP_ALWAYS_ attribute.

A list of words and/or full-word regex patterns that are always ignored by TL(RT), even if they are not in the Lexicon.

This list holds any words and re.compile objects passed to the SKIP_ALWAYS parameter at instantiation and any words added to it when the user selects ‘skip always’ in manual mode. In manual mode, the next time TL(RT) sees a word that is in this list it will not prompt the user again, it will silently skip the word.

TL will only make additions to this list in auto mode if skip_numbers is True and a number is found in the training data. TLRT does not make additions to this list in auto mode.

Returns:
SKIP_ALWAYS_list[str | re.Pattern[str]]

A list of words and/or full-word regex patterns in re.compile objects that are always ignored by TL(RT), even if they are not in the Lexicon.

property SPLIT_ALWAYS_#

Return the SPLIT_ALWAYS_ attribute.

A dictionary with words and/or full-word regex patterns as keys and their respective multi-word lists of replacements as values.

Similar to REPLACE_ALWAYS_. TL(RT) will sub these words in even if the original word is in the Lexicon. This dictionary holds anything passed to SPLIT_ALWAYS at instantiation and any splits made when ‘split always’ is selected in manual mode. In manual mode, the next time TL(RT) sees the same word in the text body it will not prompt the user again, just silently make the split.

The only way TL will add anything to this dictionary in auto mode is if auto_split is True and TL finds a valid split of an unknown word during partial_fit / fit. TLRT does not add anything to this dictionary in auto mode.

Returns:
SPLIT_ALWAYS_dict[str | re.Pattern[str], Sequence[str]]

A dictionary with words and/or full-word regex patterns in re.compile objects as keys and their respective multi-word lists of replacements as values.

dump_to_csv(X, filename)#

Dump X to csv.

Parameters:
Xlist[str]

The data.

filenamestr

The name for the saved csv file.

Returns:
None
dump_to_txt(X, filename)#

Dump X to txt.

Parameters:
Xlist[str]

The data.

filenamestr

The name for the saved txt file.

Returns:
None
fit(X, y=None)#

One-shot fit method.

TextLookup attributes are reset with each call.

Parameters:
XXContainer

The (possibly ragged) 2D container of text to have its contents cross-referenced against the pybear Lexicon.

yAny, default=None

The target for the data. Always ignored.

Returns:
selfobject

The TextLookup instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Required. The data.

yarray_like of shape (n_samples, n_outputs) or (n_samples,)

Optional, default=None. Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_trarray_like of shape (n_samples, n_features_new)

Transformed array.

classmethod get_lexicon()#

Create a singleton Lexicon instance as class attribute.

get_metadata_routing()#

get_metadata_routing is not implemented in TextLookup.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:
deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:
paramsdict[str, Any]

Parameter names mapped to their values.

property n_rows_#

Get the n_rows_ attribute.

The cumulative number of rows of text passed to partial_fit(). Not necessarily the number of rows in the outputted data.

Returns:
n_rows_int

The cumulative number of rows of text passed to partial_fit.

partial_fit(X, y=None)#

Batch-wise fit method.

Scan tokens in X and either log how to autonomously handle tokens not in the pybear Lexicon or prompt for how to handle.

Parameters:
XXContainer

The (possibly ragged) 2D container of text to have its contents cross-referenced against the pybear Lexicon.

yAny, default=None

The target for the data. Always ignored.

Returns:
selfobject

The TextLookup instance.

reset()#

Reset the TextLookup instance.

This will remove all attributes that are exposed during transform.

Returns:
selfobject

Thr reset TextLookup instance.

property row_support_#

Get the row_support_ attribute.

A 1D boolean vector of shape (n_rows_, ) that indicates which rows were kept in the data during transform. Only available if a transform has been performed, and only reflects the last dataset passed to transform().

Returns:
row_support_numpy.ndarray[bool] of shape (n_rows_, )

A boolean vector that indicates which rows were kept in the data during transform.

score(X, y=None)#

No-op score method.

Needs to be here for dask_ml wrappers.

Parameters:
XAny

The data. Ignored.

yAny, default = None

The target for the data. Ignored.

Returns:
None
set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:
**paramsdict[str: Any]

The parameters to be updated and their new values.

Returns:
selfobject

The instance with new parameter values.

transform(X, copy=False)#

Apply the handling learned in partial_fit / fit to X.

Parameters:
XXContainer

The data in (possibly ragged) 2D array-like format.

copybool, default=False

Whether to make substitutions and deletions directly on the passed X or a deepcopy of X.

Returns:
X_trlist[list[str]]

The data with instructions from fitting applied.