TextPadder#

class pybear.feature_extraction.text.TextPadder(*, fill='', n_features=None)#

Bases: FitTransformMixin, GetParamsMixin, ReprMixin, SetOutputMixin, SetParamsMixin

Map ragged text data to a shaped array, using a fill value to fill out any ragged area.

Why not just use itertools.zip_longest? TextPadder has 2 benefits not available with zip_longest.

First, TextPadder (TP) can be fit on multiple batches of data and keeps track of which example had the most strings. TP sets that value as the minimum possible feature axis length for the output during transform, and will default to returning output with that exact dimensionality unless overridden by the user to a longer dimension.

Second, TP can pad beyond the maximum number of features seen in the training data through n_features, whereas zip_longest will always return the tightest shape possible for the data passed.

TP is a scikit-style transformer and has the following methods: get_params, set_params, set_output, partial_fit, fit, transform, fit_transform, and score.

TP’s methods require that data be passed as (possibly ragged) 2D array-like containers of string data. Accepted containers include Python sequences of sequences, numpy arrays, pandas dataframes, and polars dataframes. You may not need to use this transformer if your data already fits comfortably in shaped containers like dataframes! If you pass dataframes with feature names, the original feature names are not preserved.

The partial_fit() and fit() methods find the length of the example with the most strings in it and keeps that number. This is the minimum length that can be set for the feature axis of the output at transform time. partial_fit method can fit data batch-wise and does not reset TP when called, meaning that TP can remember the longest example it has seen across many batches of data. fit resets the TP instance, causing it to forget any previously seen data, and records the maximum length anew with every call to it.

During transform, TP will always force the n_features value to be at least the maximum number of strings seen in a single example during fitting. This is the tightest possible wrap on the data without truncating, what zip_longest would do, and what TP does when n_features is set to the default value of None. If data that is shorter than n_features is passed to transform(), then all examples will be padded with the fill value to the n_features dimension. If data to be transformed has an example that is longer than any example seen during fitting (which means that TP was not fitted on this example), and is also longer than the n_features value, then an error is raised.

By default, transform returns output as a Python list of Python lists of strings. There is some control over the output container via set_output(), which allows the user to set some common output containers for the shaped array. set_output can be set to None which returns the default python list, ‘default’ which returns a numpy array, ‘pandas’ which returns a pandas dataframe, and ‘polars’, which returns a polars dataframe.

Other methods, such as fit_transform(), set_params(), and get_params(), behave as expected for scikit-style transformers.

The score() method is a no-op that allows TP to be wrapped by dask_ml ParallelPostFit and Incremental wrappers.

Parameters:
fillstr, default = “”

The character string to pad text sequences with.

n_featuresint | None, default = None

The number of features to create when padding the data, i.e., the length of the feature axis. When None, TP pads all examples to match the number of strings in the example with the most strings. If the user enters a number that is less than the number of strings in the longest example, TP will increment this parameter back to that value. The length of the feature axis of the outputted array is always the greater of this parameter or the number of strings in the example with the most strings.

Attributes:
n_features_

Get the n_features_ attribute.

Methods

fit(X[, y])

One-shot fitting operation.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

metadata routing is not implemented in TextPadder.

get_params([deep])

Get parameters for this instance.

partial_fit(X[, y])

Batch-wise fitting operation.

score(X[, y])

No-op score method.

set_output([transform])

Set the output container when the transform and fit_transform methods of the transformer are called.

set_params(**params)

Set the parameters of an instance or a nested instance.

transform(X[, copy])

Map ragged text data to a shaped array.

See also

itertools.zip_longest

Notes

Type Aliases

PythonTypes:

Sequence[Sequence[str]]

NumpyTypes:

numpy.ndarray[str]

PandasTypes:

pandas.DataFrame

PolarsTypes:

polars.DataFrame

XContainer:

PythonTypes | NumpyTypes | PandasTypes | PolarsTypes

XWipContainer:

list[list[str]]

Examples

>>> from pybear.feature_extraction.text import TextPadder as TP
>>> Trfm = TP(fill='-', n_features=5)
>>> Trfm.set_output(transform='default')
TextPadder(fill='-', n_features=5)
>>> X = [
...     ['Seven', 'ate', 'nine.'],
...     ['You', 'eight', 'one', 'two.']
... ]
>>> Trfm.fit(X)
TextPadder(fill='-', n_features=5)
>>> Trfm.transform(X)
array([['Seven', 'ate', 'nine.', '-', '-'],
       ['You', 'eight', 'one', 'two.', '-']], dtype='<U5')
fit(X, y=None)#

One-shot fitting operation.

Find the largest number of strings in any single example of the passed data.

Parameters:
XXContainer, (possibly ragged) shape (n_samples, n_features)

The data.

yAny, default = None.

The target for the data. Always ignored.

Returns:
selfobject

The TextPadder instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Required. The data.

yarray_like of shape (n_samples, n_outputs) or (n_samples,)

Optional, default=None. Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_trarray_like of shape (n_samples, n_features_new)

Transformed array.

get_metadata_routing()#

metadata routing is not implemented in TextPadder.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:
deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:
paramsdict[str, Any]

Parameter names mapped to their values.

property n_features_#

Get the n_features_ attribute.

The number of features to pad the data to during transform; the number of features in the outputted array. This number is the greater of n_features or the maximum number of strings seen in a single example during fitting.

Returns:
n_featuresint

The number of features in the outputted shaped array.

partial_fit(X, y=None)#

Batch-wise fitting operation.

Find the largest number of strings in any single example across multiple batches of data. Update the target number of features for transform.

Parameters:
XXContainer, (possibly ragged) shape (n_samples, n_features)

The data.

yAny, default = None

The target for the data. Always ignored.

Returns:
selfobject

The TextPadder instance.

score(X, y=None)#

No-op score method.

Parameters:
XAny

The data. Ignored

yAny, default = None

The target for the data. Ignored.

Returns:
None
set_output(transform=None)#

Set the output container when the transform and fit_transform methods of the transformer are called.

Parameters:
transformLiteral[‘default’, ‘pandas’, ‘polars’] | None,

The default value for the transform parameter is None.

Configure the output of transform and fit_transform.

‘default’: Default output format (numpy array)

‘pandas’: pandas dataframe output

‘polars’: polars dataframe output

None: The output container is the same as the given container.

Returns:
selfobject

The transformer instance.

set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:
**paramsdict[str: Any]

The parameters to be updated and their new values.

Returns:
selfobject

The instance with new parameter values.

transform(X, copy=False)#

Map ragged text data to a shaped array.

Parameters:
XXContainer, (possibly ragged) shape (n_samples, n_features)

The data to be transformed.

copybool, default = False

Whether to perform the transformation directly on X or on a deepcopy of X.

Returns:
X_trXWipContainer

The padded data.