TextJustifier#

class pybear.feature_extraction.text.TextJustifier(*, n_chars=79, sep=' ', sep_flags=None, line_break=None, line_break_flags=None, case_sensitive=True, backfill_sep=' ', join_2D=' ')#

Bases: FitTransformMixin, GetParamsMixin, ReprMixin, SetParamsMixin

Justify text as closely as possible to the number of characters per line given by the user.

This is not designed for making final drafts of highly formatted business letters. This is a tool designed to turn highly ragged text into block form that is more easily ingested and manipulated by humans and machines. Consider lines read in from text files or scraped from the internet. Many times there is large disparity in the number of characters per line, some lines may have a few characters, and other lines may have thousands of characters (or more.) This tool will square-up the text for you.

The cleaner your data is, the more powerful this tool is, and the more predicable are the results. TextJustifier (TJ) in no way is designed to do any cleaning. See the other pybear text wrangling modules for that. While TJ will handle any text passed to it and blindly apply the instructions given to it, results are better when this is used toward the end of a text processing workflow. For best results, pybear recommends that removal of junk characters (pybear TextReplacer), empty strings (pybear TextRemover), and leading and trailing spaces (pybear TextStripper) be done before using TJ.

There are 3 operative parameters for justifying text in this module, n_chars, sep, and line_break. n_chars is the target number of characters per line. The minimum allowed value is 1, and there is no maximum value. The sep parameter is the string sequence(s) or regex pattern(s) that tell TJ where it is allowed to wrap text. It does not mean that TJ WILL wrap that particular text, but that it can if it needs to when near the n_chars limit on a line. The wrap occurs AFTER the sep sequence. A common sep is a single space. The line_break parameter is the string sequence(s) or regex pattern(s) that tell TJ where it MUST wrap text. When TJ finds a line_break sequence, it will force a new line. The break occurs AFTER the line_break sequence. A typical line_break might be a period.

When TJ is instantiated, there must be at least one sep for TJ to wrap on but there do not need to be any specified line breaks. This means that sep must be passed but line_break can be left to the default value of None. Both sep and line_break can accept patterns as literal strings or regular expressions. Also, both parameters can accept multiple patterns to wrap/break on via 1D sequences of patterns.

To identify wrap points and line breaks using literal strings, pass a string or 1D sequence of strings to sep and line_break. You can mix containers to the different parameters, i.e., one could be a sequence and the other could be a single string. To identify wrap points and line breaks using regex patterns, pass a re.compile object with the regex pattern (and flags, if desired) or pass a 1D sequence of such objects. DO NOT PASS A REGEX PATTERN AS A LITERAL STRING. YOU WILL NOT GET THE CORRECT RESULT. ALWAYS PASS REGEX PATTERNS IN A re.compile OBJECT. DO NOT ESCAPE LITERAL STRINGS, TextJustifier WILL DO THAT FOR YOU. If you don’t know what any of that means, then you don’t need to worry about it, just use literal strings.

Literal strings and re.compile objects cannot be mixed. You must go all-in on literals or all-in on regex. This means that you cannot pass 1D lists containing a mix of literals and re.compile objects to sep and line_break. Additionally, whatever wrap-point identification method is used in sep must be also be used for line_break. Meaning, if you used re.compile objects to indicate wrap points for sep, then you must also use re.compile objects to indicate break points for line_break.

Literal string mode has validation and protections in place that prevent conflicts that could lead to undesired results. These safeguards make for a predictable tool. But these safeguards are not in place in regex mode. The reason is that the exact behavior of literal strings, as opposed to regex, can be predicted before ever seeing any text. Conflicts are impossible to predict when using regex unless you know the text it is applied to beforehand. No sep can be a substring of another sep. No sep can be identical to a line_break entry and no sep can be a substring of a line_break. No line_break can be a substring of another line_break. No line_break can be identical to a sep entry and no line_break can be a substring of a sep. But these rules do not apply when using regex. In regex mode, a conflict exists when both the sep pattern and the line_break pattern identify the same location in text as the first character of a match. In that case, TJ applies sep. It is up to the user to assess the pitfalls and the likelihood of error when using regex on their data. The user should inspect their results to ensure the desired outcome.

TJ searches always default to case-sensitive, but can be made to be case-insensitive. You can globally set this behavior via the case_sensitive parameter. For those of you that know regex, you can also put flags in the re.compile objects passed to sep and line_break. Also, flags can be set globally for each of those parameters via sep_flags and line_break_flags, respectfully. Case-sensitivity is generally controlled by case_sensitive but IGNORECASE flags passed via re.compile objects or to the ‘flags’ parameters will ALWAYS overrule case_sensitive.

Some lines in the text may not have any of the given wrap separators or line breaks at the end of the line. When justifying text and there is a shortfall of characters in a line, TJ will look to the next line to backfill strings. In the case where the line being backfilled onto does not have a separator at the end of the string, the character string given by backfill_sep will separate the otherwise separator-less string from the string being backfilled onto it.

As simple as the tool is in concept, there are some nuances. Here is a non-exhaustive list of some of the quirks that may help the user understand some edge cases and explain why TJ returns the things that it does. 1) TJ will not autonomously hyphenate words. 2) If a line has no wraps or line-breaks in it, then TJ can only do 2 things with it. If a line is given as longer than n_chars and there are no places to wrap, TJ will return the line as given, regardless of what n_chars is set to. But if the line is shorter than n_chars, it may have text from the next line(s) backfilled onto it. 3) If n_chars is set very low, perhaps lower than the length of words (tokens) that may normally be encountered, then those words/lines will extend beyond the n_chars margin. Cool trick: if you want an itemized list of all the tokens in your text, set n_chars to 1.

TJ accepts 1D and 2D data formats. Accepted objects include Python built-in lists, tuples, and sets, numpy arrays, pandas series and dataframes, and polars series and dataframes. When data is passed in a 1D container, results are always returned as a 1D Python list of strings. When data is passed in a 2D container, TJ uses pybear TextJoiner and the join_2D parameter to convert it to a 1D list for processing. Then, once the processing is done, TJ uses pybear TextSplitter and the join_2D parameter again to convert it back to 2D. The 2D results are always returned in a Python list of Python lists of strings. See TextJoiner and TextSplitter.

TJ is a full-fledged scikit-style transformer. It has fully functional get_params, set_params, transform, and fit_transform methods. It also has partial_fit, fit, and score methods, which are no-ops. TJ technically does not need to be fit because it already knows everything it needs to do transformations from the parameters. These no-op methods are available to fulfill the scikit transformer API and make TJ suitable for incorporation into larger workflows, such as Pipelines and dask_ml wrappers.

Because TJ doesn’t need any information from partial_fit() and fit(), it is technically always in a ‘fitted’ state and ready to transform data. Checks for fittedness will always return True.

TJ has one attribute, n_rows_, which is only available after data has been passed to transform(). n_rows_ is the number of rows of text seen in the original data. The outputted data may not have the same number of rows as the inputted data. This number is not cumulative and only reflects that last batch of data passed to transform.

Parameters:

n_charsint, default = 79: The target number of characters per line when justifying the given text. Minimum allowed value is 1; there is no maximum value. Under normal expected operation with reasonable margins, the outputted text will not exceed this number but can fall short. If margins are unusually small, the output can exceed the given margins (e.g. the margin is set lower than an individual word’s length.)
sepSepType, default = ‘ ‘: The literal string(s) or re.compile object(s) that indicate to TextJustifier where it is allowed to wrap a line. When passed as a 1D sequence, TJ will consider any of those patterns as a place where it can wrap a line. If a sep pattern is in the middle of a sequence that might otherwise be expected to be contiguous, TJ will wrap a new line AFTER the sep indiscriminately if proximity to the n_chars limit dictates to do so. Cannot be an empty string or a regex pattern that blatantly returns zero-span matches. Cannot be an empty sequence. When passed as re.compile object(s), it is only validated to be an instance of re.Pattern and that it is not likely to return zero-span matches. TJ does not assess the validity of the expression itself. Any exceptions would be raised by re.search. See the main docs for more discussion about limitations on what can be passed here.
sep_flagsint | None, default = None: The flags for the sep parameter. THIS WILL APPLY EVEN IF YOU PASS LITERAL STRINGS TO sep. IGNORECASE flags passed to this will overrule case_sensitive for sep. This parameter is only validated by TJ to be an instance of numbers.Integral or None. TJ does not assess the validity of the value. Any exceptions would be raised by re.search.
line_breakLineBreakType, default = None: Literal string(s) or re.compile object(s) that indicate to TJ where it MUST end a line. TJ will start a new line immediately AFTER the occurrence of the pattern regardless of the number of characters in the line. When passed as a 1D sequence of literals or re.compile objects, TJ will start a new line immediately after all occurrences of the patterns given. If None, do not force any line breaks. If the there are no patterns in the data that match the given strings, then there are no forced line breaks. If a line_break pattern is in the middle of a sequence that might otherwise be expected to be contiguous, TJ will force a new line after the line_break indiscriminately. Cannot be an empty string or a regex pattern that blatantly returns zero-span matches. Cannot be an empty 1D sequence. When passed as re.compile object(s), it is only validated to be an instance of re.Pattern and that it is not likely to return zero-span matches. TJ does not assess the validity of the expression itself. Any exceptions would be raised by re.search. See the main docs for more discussion about limitations on what can be passed here.
line_break_flagsint | None, default = None: The flags for the line_break parameter. THIS WILL APPLY EVEN IF YOU PASS LITERAL STRINGS TO line_break. IGNORECASE flags passed to this will overrule case_sensitive for line_break. This parameter is only validated by TJ to be an instance of numbers.Integral or None. TJ does not assess the validity of the value. Any exceptions would be raised by re.search.
backfill_sepstr, default = ‘ ‘: In the case where a line is shorter than n_chars, DOES NOT END WITH A WRAP SEPARATOR, and the following line is short enough to be merged with it, this character string will separate the two strings when merged. If you do not want a separator in this case, pass an empty string to this parameter.
join_2Dstr, default = ‘ ‘: Ignored if the data is given as a 1D sequence. For 2D containers of strings, this is the character string sequence that is used to join the strings within rows to convert the data to 1D for processing. The single string value is used to join the strings within the rows for all rows in the data.

Attributes:

n_rows_: Get the n_rows_ attribute.

Methods

`fit`(X[, y])	No-op one-shot fit operation.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_metadata_routing`()	get_metadata_routing is not implemented in TextJustifier.
`get_params`([deep])	Get parameters for this instance.
`partial_fit`(X[, y])	No-op batch-wise fit operation.
`score`(X[, y])	No-op score method.
`set_params`(**params)	Set the parameters of an instance or a nested instance.
`transform`(X[, copy])	Justify the text in a 1D sequence of strings or a (possibly ragged) 2D array-like of strings.

Notes

Type Aliases

PythonTypes:: Sequence[str] | Sequence[Sequence[str]] | set[str]
NumpyTypes:: numpy.ndarray
PandasTypes:: pandas.Series | pandas.DataFrame
PolarsTypes:: polars.Series | polars.DataFrame
XContainer:: PythonTypes | NumpyTypes | PandasTypes | PolarsTypes
XWipContainer:: list[str] | list[list[str]]
NCharsType:: int
CoreSepBreakTypes:: str | Sequence[str] | re.Pattern[str] | Sequence[re.Pattern[str]]
SepType:: CoreSepBreakTypes
LineBreakType:: CoreSepBreakTypes | None
CoreSepBreakWipType:: re.Pattern[str] | tuple[re.Pattern[str], …]
SepWipType:: CoreSepBreakWipType
LineBreakWipType:: CoreSepBreakWipType | None
CaseSensitiveType:: bool
SepFlagsType:: int | None
LineBreakFlagsType:: int | None
BackfillSepType:: str
Join2DType:: str

Examples

>>> from pybear.feature_extraction.text import TextJustifier as TJ
>>> trfm = TJ(n_chars=70, sep=' ', backfill_sep=' ')
>>> X = [
...     'Old Mother Hubbard',
...     'Went to the cupboard',
...     'To get her poor dog a bone;',
...     'But when she got there,',
...     'The cupboard was bare,',
...     'And so the poor dog had none.',
...     'She went to the baker’s',
...     'To buy him some bread;',
...     'And when she came back,',
...     'The poor dog was dead.'
... ]
>>> out = trfm.fit_transform(X)
>>> out = list(map(str.strip, out))
>>> for _ in out:
...     print(_)
Old Mother Hubbard Went to the cupboard To get her poor dog a bone;
But when she got there, The cupboard was bare, And so the poor dog
had none. She went to the baker’s To buy him some bread; And when she
came back, The poor dog was dead.
>>>
>>> # Demonstrate regex and do a different justify on the same data
>>> trfm.set_params(n_chars=45, sep=[re.compile(' '), re.compile(',')])
TextJustifier(n_chars=45, sep=[re.compile(' '), re.compile(',')])
>>> out = trfm.fit_transform(X)
>>> out = list(map(str.strip, out))
>>> for _ in out:
...     print(_)
Old Mother Hubbard Went to the cupboard To
get her poor dog a bone; But when she got
there,The cupboard was bare,And so the poor
dog had none. She went to the baker’s To buy
him some bread; And when she came back,The
poor dog was dead.

fit(X, y=None)#

No-op one-shot fit operation.

Parameters:

XXContainer: The data to justify. Ignored.
yAny, default = None: The target for the data. Always ignored.

Returns:

selfobject: The TextJustifier instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

Xarray_like of shape (n_samples, n_features): Required. The data.
yarray_like of shape (n_samples, n_outputs) or (n_samples,): Optional, default=None. Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns:

X_trarray_like of shape (n_samples, n_features_new): Transformed array.

get_metadata_routing()#: get_metadata_routing is not implemented in TextJustifier.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:

deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:

paramsdict[str, Any]: Parameter names mapped to their values.

property n_rows_#

Get the n_rows_ attribute.

The number of rows in data passed to transform(); may not be the same as the number of rows in the outputted data. This number is not cumulative and only reflects the last batch of data passed to transform.

Returns:

n_rows_int: The number of rows in the data passed to transform.

partial_fit(X, y=None)#

No-op batch-wise fit operation.

Parameters:

XXContainer: The data to justify. Ignored.
yAny, default = None: The target for the data. Always ignored.

Returns:

selfobject: The TextJustifier instance.

score(X, y=None)#

No-op score method.

Needs to be here for dask_ml wrappers.

Parameters:

XAny: The data. Ignored.
yAny, default = None: The target for the data. Ignored.

Returns:

None

set_params(**params)#

Set the parameters of an instance or a nested instance.

This method works on simple estimator and transformer instances as well as on nested objects (such as GridSearch instances).

Setting the parameters of simple estimators and transformers is straightforward. Pass the exact parameter name and its value as a keyword argument to the set_params method call. Or use ** dictionary unpacking on a dictionary keyed with exact parameter names and the new parameter values as the dictionary values. Valid parameter keys can be listed with get_params().

Setting the parameters of a GridSearch instance (but not the nested instance) can be done in the same way as above. The parameters of nested instances can be updated using prefixes on the parameter names.

Simple estimators in a GridSearch instance can be updated by prefixing the estimator’s parameters with estimator__. For example, if some estimator has a ‘depth’ parameter, then setting the value of that parameter to 3 would be accomplished by passing estimator__depth=3 as a keyword argument to the set_params method call.

The parameters of a pipeline nested in a GridSearch instance can be updated using the form estimator__<pipe_parameter>. The parameters of the steps of a pipeline have the form <step>__<parameter> so that it’s also possible to update a step’s parameters through the set_params method interface. The parameters of steps in the pipeline can be updated using estimator__<step>__<parameter>.

Parameters:

**paramsdict[str: Any]: The parameters to be updated and their new values.

Returns:

selfobject: The instance with new parameter values.

transform(X, copy=False)#

Justify the text in a 1D sequence of strings or a (possibly ragged) 2D array-like of strings.

Parameters:

XXContainer: The data to justify.
copybool, default = False: Whether to directly operate on the passed X or on a deepcopy of X.

Returns:

X_trXWipContainer: The justified data returned as a 1D Python list of strings.