TextJoiner#
- class pybear.feature_extraction.text.TextJoiner(*, sep=' ')#
Bases:
FitTransformMixin,GetParamsMixin,ReprMixin,SetParamsMixinJoin a (possibly ragged) 2D array-like of strings across rows with the sep character string(s).
When passed a 2D array-like of strings, TextJoiner (TJ) joins each row-wise sequence of strings on the value given by sep and returns a 1D Python list of joined strings in place of the original inner containers.
The sep parameter can be passed as a single character string, in which case all strings in the data will be joined by that string. sep can also be passed as a 1D sequence of strings, whose length must equal the number of rows of text in the data. In that case, TJ uses the string in each position of the 1D sequence to join the corresponding row of text in the data.
TJ is a full-fledged scikit-style transformer. It has fully functional get_params, set_params, transform, and fit_transform methods. It also has partial_fit, fit, and score methods, which are no-ops. TJ technically does not need to be fit because it already knows everything it needs to do transformations from sep. These no-op methods are available to fulfill the scikit transformer API and make TJ suitable for incorporation into larger workflows, such as Pipelines and dask_ml wrappers.
Because TJ doesn’t need any information from
partial_fit()andfit(), it is technically always in a ‘fitted’ state and ready to transform data. Checks for fittedness will always return True.TJ has one attribute,
n_rows_, which is only available after data has been passed totransform(). n_rows_ is the number of rows of text seen in the original data, and must be the number of strings in the returned 1D Python list.- Parameters:
- sepstr | Sequence[str], default=’ ‘
The character sequence to insert between individual strings when joining the 2D input data across rows. If a 1D sequence of strings, then the sep value in each position is used to join the corresponding row in X. In that case, the number of entries in sep must equal the number of rows in X.
- Attributes:
n_rows_Get the n_rows_ attribute.
Methods
fit(X[, y])No-op one-shot fit method.
fit_transform(X[, y])Fit to data, then transform it.
get_metadata_routing is not implemented in TextJoiner.
get_params([deep])Get parameters for this instance.
partial_fit(X[, y])No-op batch-wise fit method.
score(X[, y])No-op score method.
set_params(**params)Set the parameters of an instance or a nested instance.
transform(X[, copy])Convert each row of tokenized strings in X to a single string.
Notes
Type Aliases
- PythonTypes:
Sequence[Sequence[str]]
- NumpyTypes:
numpy.ndarray[str]
- PandasTypes:
pandas.DataFrame
- PolarsTypes:
polars.DataFrame
- XContainer:
PythonTypes | NumpyTypes | PandasTypes | PolarsTypes
- XWipContainer:
list[str]
- SepType:
str | Sequence[str]
Examples
>>> from pybear.feature_extraction.text import TextJoiner as TJ >>> trfm = TJ(sep=' ') >>> X = [['Brevity', 'is', 'wit.']] >>> trfm.fit_transform(X) ['Brevity is wit.'] >>> # Change the joining separator to 'xyz' >>> trfm.set_params(sep='xyz') TextJoiner(sep='xyz') >>> trfm.fit_transform(X) ['Brevityxyzisxyzwit.']
- fit(X, y=None)#
No-op one-shot fit method.
- Parameters:
- XXContainer
The (possibly ragged) 2D container of text to be joined along rows using the sep character string(s). Ignored.
- yAny, default=None
The target for the data. Always ignored.
- Returns:
- selfobject
The TextJoiner instance.
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Required. The data.
- yarray_like of shape (n_samples, n_outputs) or (n_samples,)
Optional, default=None. Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_trarray_like of shape (n_samples, n_features_new)
Transformed array.
- get_metadata_routing()#
get_metadata_routing is not implemented in TextJoiner.
- get_params(deep=True)#
Get parameters for this instance.
The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.
- Parameters:
- deepbool, default = True
For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.
For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.
- Returns:
- paramsdict[str, Any]
Parameter names mapped to their values.
- property n_rows_#
Get the n_rows_ attribute.
The number of rows of text seen during transform and the number of strings in the returned 1D Python list.
- Returns:
- n_rows_int
The number of rows in the data passed to
transform().
- partial_fit(X, y=None)#
No-op batch-wise fit method.
- Parameters:
- XXContainer
The (possibly ragged) 2D container of text to be joined along rows using the sep character string(s). Ignored.
- yAny, default=None
The target for the data. Always ignored.
- Returns:
- selfobject
The TextJoiner instance.
- score(X, y=None)#
No-op score method.
Needs to be here for dask_ml wrappers.
- Parameters:
- XAny
The data. Ignored.
- yAny, default = None
The target for the data. Ignored.
- Returns:
- None
- set_params(**params)#
Set the parameters of an instance or a nested instance.
This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).
Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with
get_params().Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.
Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.
The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.
- Parameters:
- **paramsdict[str: Any]
The parameters to be updated and their new values.
- Returns:
- selfobject
The instance with new parameter values.
- transform(X, copy=False)#
Convert each row of tokenized strings in X to a single string.
Joining on the string character sequence(s) provided by sep. Returns a Python list of strings.
- Parameters:
- XXContainer
The (possibly ragged) 2D container of text to be joined along rows using the sep character string(s).
- copybool, default = False
Whether to operate directly on the original X or a deepcopy of X.
- Returns:
- X_trXWipContainer
A single list containing strings, one string for each row in the original X.