TextSplitter#

class pybear.feature_extraction.text.TextSplitter(*, sep=None, case_sensitive=True, maxsplit=None, flags=None)#

Bases: FitTransformMixin, GetParamsMixin, ReprMixin, SetParamsMixin

Split a dataset of strings on the given separator(s).

So why not just use str.split or re.split? TextSplitter has some advantages over the built-ins.

First, multiple splitting criteria can be passed to the sep parameter to split on multiple character sequences, which str.split and re.split cannot do natively. For example, consider the string “How, now. brown; cow?”. This can be split on the comma, period, and semicolon by passing a tuple to the sep parameter, such as (‘,’, ‘.’, ‘;’). The output will be [“How”, “ now”, “ brown”, “ cow?”].

Second, the splitting criteria are simultaneously mapped over a list of strings, performing many splits in a single operation. Both str.split and re.split only accept one string argument.

Third, the split criteria and supporting parameters can be tweaked for individual strings in the data by passing them in lists. This allows fine-grained control over splitting every string in the data, if you need it.

Finally, TextSplitter is a scikit-style transformer and can be integrated into larger workflows.

TextSplitter (TS) performs splits by searching for the user-given separators in the text and splits strings on that character sequence when one is found. The matching separator sequence is NOT preserved in the text when the split is made. You can tell TextSplitter what separators to split with by passing None, literal strings, or regular expressions in re.compile objects to sep. None does not split. A single literal string or re.compile object will split the text on all occurrences of that pattern in the text body. When using regex, ALWAYS pass your regex patterns in a re.compile object. DO NOT PASS A REGEX PATTERN AS A LITERAL STRING. YOU WILL NOT GET THE CORRECT RESULT. ALWAYS PASS REGEX PATTERNS IN A re.compile OBJECT. DO NOT ESCAPE LITERAL STRINGS, TextSplitter WILL DO THAT FOR YOU. If you don’t know what any of that means, then you don’t need to worry about it.

You can pass tuples of literal strings and/or re.compile objects to sep to split on multiple separator patterns at the same time. Also, Nones, literal strings, re.compile objects, and tuples of literal strings and/or re.compile objects can be passed in a list. The number of entries in the list must equal the number of strings in the data. Each entry in the list is applied to the corresponding row in the data.

If no parameters are passed, i.e., all parameters are left to their default values at instantiation, then TextSplitter does a no-op split, but does change your data from 1D to 2D.

Separator searches always default to case-sensitive, but can be made to be case-insensitive. You can globally set this behavior via the case_sensitive parameter. For those of you that know regex, you can also put flags in the re.compile objects passed to sep, or flags can be set globally via flags. Case-sensitivity is generally controlled by case_sensitive but IGNORECASE flags passed via re.compile objects or flags will ALWAYS overrule case_sensitive. case_sensitive also accepts lists so that you can control this behavior down to the individual string.

TextSplitter mimics the ‘maxsplit’ behavior of re.split. See the docs for re.split for more information. Therefore, when passing values to maxsplit, obey the rules for ‘maxsplit’ in re.split. When passing multiple split criteria, i.e., you have passed a tuple of literal strings and/or re.compile objects to sep, the maxsplit parameter is applied cumulatively for all separators working from left to right across a string in the data. For example, consider the string “One, two, buckle my shoe. Three, four, shut the door.”. We are going to split on commas and periods, and perform 4 splits, working from left to right. We enter sep as (‘,’, ‘.’) and pass the number 4 to maxsplit. Then we pass the string in a list to the transform() method of TextSplitter. The output will be [“One”, “ two”, “ buckle my shoe”, “ Three”, “ four, shut the door.”] The maxsplit argument worked from left to right and performed 4 splits on commas and periods cumulatively counting the application of the splits for all separators.

TextSplitter is a full-fledged scikit-style transformer. It has functional transform and fit_transform methods, as well as get_params and set_params methods. It has no-op partial_fit, fit, and score methods, so that it integrates into larger workflows like scikit pipelines and dask_ml wrappers.

TextSplitter accepts 1D list-like vectors of strings. Accepted containers include Python lists, tuples, and sets, numpy vectors, pandas series, and polars series. Output is always returned as a Python list of Python lists of strings.

Parameters:

sepSepsType, default = None: The separator(s) to split the strings in X on. None skips every string in X, performing no splits. When passed as a single literal character string, that is applied to every string in X. If a single regular expression in a re.compile object is passed, that split is performed on every entry in X. When passed as a tuple of literal character strings and/or re.compile objects, each separator in the tuple is applied to every string, subject to the allowance set by maxsplit. If passed as a list of separators, the number of entries must match the number of strings in X, and each literal, re.compile, or tuple of literals/re.compiles is applied to the corresponding string in X. If any entry in the list is None, no split is performed on the corresponding string in X.
case_sensitiveCaseSensitiveType: Global setting for case-sensitivity. If True (the default) then all searches are case-sensitive. If False, TS will look for matches regardless of case. This setting is overriden when IGNORECASE flags are passed in re.compile objects or to flags.
maxsplitMaxSplitsType, default = None: The maximum number of splits to perform on a string. Only applies when something is passed to sep. If None, the default number of splits for re.split is used on every string in X. If passed as an integer, that number is applied to every string in X. If passed as a list, the number of entries must match the number of strings in X, and each is applied correspondingly to X. If any entry in the list is None, no split is performed on the corresponding string in X.
flagsFlagsType, default = None: The flags value(s) for the separator searches. If you do not know what this means then ignore this and just use case_sensitive. If None, the default flags for re.split are used on every string in the data. If a single flags object, that is applied to every string in the data. If passed as a list, the number of entries must match the number of strings in X. Flags objects and Nones in the list follow the same rules stated above.

Methods

`fit`(X[, y])	No-op one-shot fitting of TextSplitter.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_metadata_routing`()	metadata routing is not implemented in TextSplitter.
`get_params`([deep])	Get parameters for this instance.
`partial_fit`(X[, y])	No-op batch-wise fitting of TextSplitter.
`score`(X[, y])	No-op scorer.
`set_params`(**params)	Set the parameters of an instance or a nested instance.
`transform`(X[, copy])	Split the strings in X on the separator(s).

See also

re.split

Notes

Type Aliases

PythonTypes:: list[str] | tuple[str] | set[str]
NumpyTypes:: numpy.ndarray[str]
PandasTypes:: pandas.Series
PolarsTypes:: polars.Series
XContainer:: PythonTypes | NumpyTypes | PandasTypes | PolarsTypes
XWipContainer:: list[list[str]]
SepType:: None | str | re.Pattern[str] | tuple[str | re.Pattern[str], …]
SepsType:: SepType | list[SepType]
CaseSensitiveType:: bool | list[bool | None]
MaxSplitType:: int | None
MaxSplitsType:: MaxSplitType | list[MaxSplitType]
FlagType:: int | None
FlagsType:: FlagType | list[FlagType]

Examples

>>> from pybear.feature_extraction.text import TextSplitter as TS
>>> import re
>>>
>>> Trfm = TextSplitter(sep=' ', maxsplit=2)
>>> X = [
...     'This is a test.',
...     'This is only a test.'
... ]
>>> Trfm.fit(X)
TextSplitter(maxsplit=2, sep=' ')
>>> Trfm.transform(X)
[['This', 'is', 'a test.'], ['This', 'is', 'only a test.']]

>>> Trfm = TextSplitter(sep=re.compile('s'), maxsplit=2)
>>> X = [
...     'This is a test.',
...     'This is only a test.'
... ]
>>> Trfm.fit(X)
TextSplitter(maxsplit=2, sep=re.compile('s'))
>>> Trfm.transform(X)
[['Thi', ' i', ' a test.'], ['Thi', ' i', ' only a test.']]

fit(X, y=None)#

No-op one-shot fitting of TextSplitter.

Parameters:

XXContainer: A 1D sequence of strings to be split. Ignored.
yAny, default = None: The target for the data. Always ignored.

Returns:

selfobject: The TextSplitter instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

Xarray_like of shape (n_samples, n_features): Required. The data.
yarray_like of shape (n_samples, n_outputs) or (n_samples,): Optional, default=None. Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns:

X_trarray_like of shape (n_samples, n_features_new): Transformed array.

get_metadata_routing()#: metadata routing is not implemented in TextSplitter.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:

deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:

paramsdict[str, Any]: Parameter names mapped to their values.

partial_fit(X, y=None)#

No-op batch-wise fitting of TextSplitter.

Parameters:

XXContainer: A 1D sequence of strings to be split. Ignored.
yAny, default=None: The target for the data. Always ignored.

Returns:

selfobject: The TextSplitter instance.

score(X, y=None)#

No-op scorer.

Parameters:

XXContainer: A 1D sequence of strings. Ignored.
yAny, default = None: The target for the data. Ignored.

Returns:

None

set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:

**paramsdict[str: Any]: The parameters to be updated and their new values.

Returns:

selfobject: The instance with new parameter values.

transform(X, copy=False)#

Split the strings in X on the separator(s).

Parameters:

XXContainer: A 1D sequence of strings to be split.
copybool, default=False: Whether to perform the splits directly on X or on a deepcopy of X.

Returns:

X_trXWipContainer: The split strings.