TextStatistics#

class pybear.feature_extraction.text.TextStatistics(*, store_uniques=True)#

Bases: FitTransformMixin, GetParamsMixin, ReprMixin

Generate summary information about the strings and characters in text data.

Statistics include:

  • size (number of strings fitted)

  • unique strings count

  • average length and standard deviation of all strings

  • max string length

  • min string length

  • string frequencies

  • ‘starts with’ frequency

  • single character frequency

  • longest strings

  • shortest strings

TextStatistics (TS) has 2 functional scikit-style transformer methods, fit() and partial_fit(). The transform() method is a no-op because TS does not mutate data, it only reports information about the strings and characters in it.

TextStatistics can do one-shot training on a single batch of data via fit, or can be trained on multiple batches via partial_fit. The fit method resets the instance with each call, that is, all information held within the instance prior is deleted and the new fit information repopulates. The partial_fit method, however, does not reset and accumulates information across all batches seen. This makes TS suitable for streaming data and batch-wise training, such as with a dask_ml Incremental wrapper.

TS does have other methods that allow access to certain functionality, such as conveniently printing summary information from attributes to screen. See the methods section of the docs.

TS accepts 1D list-likes or (possibly ragged) 2D array-likes containing only strings. This includes Python lists, sets, and tuples, numpy vectors, pandas series, polars series, 2D Python built-ins, numpy arrays, pandas dataframes, and polars dataframes.

TS is case-sensitive during fitting, always. This is a design choice so that users who want to differentiate between the same characters in different cases can do so. If you want your strings to be treated in a non-case-sensitive way, normalize the case of your strings prior to fitting on TS. (hint: use pybear TextNormalizer).

The TS class takes only one parameter, store_uniques. More on that below. The store_uniques parameter is intended to be set once at instantiation and not changed thereafter. This protects the integrity of the reported information. As such, TS has a no-op set_params() method. Advanced users may access and set the store_uniques parameter directly on the instance, but the impacts of doing so in the midst of a series of partial fits or afterward is not tested. pybear does not recommend this technique; create a new instance with the desired setting and fit your data again. The TS get_params() method is fully functional.

When the store_uniques parameter is True, the TS instance retains a dictionary of all the unique strings it has seen during fitting and their frequencies. In this case, TS is able to yield all the information that it is designed to collect. This is ideal for situations with a ‘small’ number of unique strings, such as when fitting on cleaned tokens, where a recurrence of a unique will simply increment the count of that unique in the dictionary instead of creating a new entry.

When store_uniques is False, however, the unique strings seen during fitting are not stored. In this case, the memory footprint of the TS instance will not grow linearly with the number of unique strings seen during fitting. This enables TS to fit on practially unlimited amounts of text data. This is ideal for situations where the individual strings being fit are phrases, sentences, or even entire books. This comes at cost, though, because some reporting capability is lost.

Functionality available when store_uniques is False is size (the number of strings seen by the TS instance), average length, standard deviation of length, maximum length, minimum length, overall character frequency, and first character frequency. Functionality lost includes the unique strings themselves as would otherwise be available through uniques_ and string_frequency_, and information about longest string and shortest string. Methods whose information reporting is impacted include lookup_string() and lookup_substring(), as well as the associated printing methods.

Parameters:
store_uniquesbool, default = True

Whether to retain the unique strings seen by the TextStatistics instance in memory. If True, all attributes and print methods are fully informative. If False, the string_frequency_ and uniques_ attributes are always empty, and functionality that depends on these attributes have reduced capability.

Attributes:
size_

Get the size_ attribute.

uniques_

Get the uniques_ attribute.

overall_statistics_

Get the overall_statistics_ attribute.

string_frequency_

Get the string_frequency_ attribute.

startswith_frequency_

Get the startswith_frequency_ attribute.

character_frequency_

Get the character_frequency_ attribute.

Methods

fit(X[, y])

Single batch training of the TextStatistics instance.

fit_transform(X[, y])

Fit to data, then transform it.

get_longest_strings([n])

The longest strings seen by the TextStatistics instance during fitting.

get_params([deep])

Get parameters for this instance.

get_shortest_strings([n])

The shortest strings seen by the TextStatistics instance during fitting.

lookup_string(pattern[, case_sensitive])

Use string literals or regular expressions to look for whole string matches (not substrings) in the fitted words.

lookup_substring(pattern[, case_sensitive])

Use string literals or regular expressions to look for substring matches in the fitted words.

partial_fit(X[, y])

Batch-wise fitting of the TextStatistics instance.

print_character_frequency()

Print the character_frequency_ attribute to screen.

print_longest_strings([n])

Print the longest strings in string_frequency_ to screen.

print_overall_statistics()

Print overall_statistics_ to screen.

print_shortest_strings([n])

Print the shortest strings in string_frequency_ to screen.

print_startswith_frequency()

Print the startswith_frequency_ attribute to screen.

print_string_frequency([n])

Print the string_frequency_ attribute to screen.

score(X[, y])

No-op score method.

set_params(**params)

No-op set_params method.

transform(X)

A no-op transform method for data processing scenarios that may require the transform method.

Examples

>>> from pybear.feature_extraction.text import TextStatistics as TS
>>> STRINGS = ['I am Sam', 'Sam I am', 'That Sam-I-am!',
...    'That Sam-I-am!', 'I do not like that Sam-I-am!']
>>> trfm = TS(store_uniques=True)
>>> trfm.fit(STRINGS)
TextStatistics()
>>> trfm.size_
5
>>> trfm.overall_statistics_['max_length']
28
>>> trfm.overall_statistics_['average_length']
14.4
>>> STRINGS = ['a', 'a', 'b', 'c', 'c', 'c', 'd', 'd', 'e', 'f', 'f']
>>> trfm = TextStatistics()
>>> trfm.fit(STRINGS)
TextStatistics()
>>> trfm.size_
11
>>> trfm.string_frequency_
{'a': 2, 'b': 1, 'c': 3, 'd': 2, 'e': 1, 'f': 2}
>>> trfm.uniques_
['a', 'b', 'c', 'd', 'e', 'f']
>>> trfm.overall_statistics_['max_length']
1
>>> trfm.character_frequency_
{'a': 2, 'b': 1, 'c': 3, 'd': 2, 'e': 1, 'f': 2}
property character_frequency_#

Get the character_frequency_ attribute.

A dictionary that holds all the unique single characters and their frequencies for all the strings fitted on the TextStatistics instance.

Returns:
character_frequency_dict[str, int]

The counts of every character seen during fitting.

fit(X, y=None)#

Single batch training of the TextStatistics instance. The instance is reset and the only information retained is that associated with this single batch of data.

Parameters:
XXContainer

A 1D list-like or 2D array-like of strings to report statistics for. Can be empty. Strings do not need to be in the pybear Lexicon.

yAny, default = None

A target for the data. Always ignored.

Returns:
selfobject

The TextStatistics instance.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Required. The data.

yarray_like of shape (n_samples, n_outputs) or (n_samples,)

Optional, default=None. Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_trarray_like of shape (n_samples, n_features_new)

Transformed array.

get_longest_strings(n=10)#

The longest strings seen by the TextStatistics instance during fitting.

Only available if store_uniques is True. If False, the uniques seen during fitting are not available and an empty dictionary is always returned.

Parameters:
nint, default = 10

The number of the top longest strings to return.

Returns:
dict[str, int]:

The top ‘n’ longest strings seen by the TextStatistics instance during fitting. This will always be empty if store_uniques is False.

get_params(deep=True)#

Get parameters for this instance.

The ‘instance’ may be a pybear estimator, transformer, or a gridsearch module that wraps a nested estimator or pipeline.

Parameters:
deepbool, default = True

For instances that do not have nested instances in an estimator attribute (such as estimators or transformers), this parameter is ignored and the same (full) set of parameters for the instance is returned regardless of the value of this parameter.

For instances that have nested instances (such as a GridSearch instance with a nested estimator or pipeline) in the estimator attribute, deep=False will only return the parameters for the wrapping instance. For example, a GridSearch module wrapping an estimator will only return parameters for the GridSearch instance, ignoring the parameters of the nested instance. When deep=True, this method returns the parameters of the wrapping instance as well as the parameters of the nested instance. When the nested instance is a single estimator, the full set of parameters for the single estimator are returned in addition to the parameters of the wrapping instance. If the nested object is a pipeline, the parameters of the pipeline and the parameters of each of the steps in the pipeline are returned in addition to the parameters of the wrapping instance. The estimator’s parameters are prefixed with estimator__.

Returns:
paramsdict[str, Any]

Parameter names mapped to their values.

get_shortest_strings(n=10)#

The shortest strings seen by the TextStatistics instance during fitting.

Only available if store_uniques is True. If False, the uniques seen during fitting are not available and an empty dictionary is always returned.

Parameters:
nint, default = 10

The number of the top shortest strings to return.

Returns:
dict[str, int]:

The top ‘n’ shortest strings seen by the TextStatistics instance during fitting. This will always be empty if store_uniques is False.

lookup_string(pattern, case_sensitive=True)#

Use string literals or regular expressions to look for whole string matches (not substrings) in the fitted words.

pattern can be a literal string or a regular expression in a re.compile object.

If re.compile object is passed, case_sensitive is ignored and the fitted words are searched with the compile object as given. If string literal is passed and case_sensitive is True, search for an exact match of the whole passed string; if case_sensitive is False, search without regard to case.

If an exact match is not found, return an empty list. If matches are found, return a 1D list of the matches in their original form from the fitted data.

This is only available if store_uniques is True. If False, the unique strings that have been fitted on the TS instance are not retained therefore cannot be searched, and an empty list is always returned.

Parameters:
patternstr | re.Pattern[str]

Character sequence or regular expression in a re.compile object to be looked up against the strings fitted on the TextStatistics instance.

case_sensitivebool, default = True

Ignored if a re.compile object is passed to pattern. If True, search for the exact pattern in the fitted data. If False, ignore the case of the words in uniques_ while performing the search.

Returns:
list[str]:

If there are any matches, return the matching string(s) from the originally fitted data in a 1D list; if there are no matches, return an empty list.

lookup_substring(pattern, case_sensitive=True)#

Use string literals or regular expressions to look for substring matches in the fitted words.

pattern can be a literal string or a regular expression in a re.compile object.

If re.compile object is passed, case_sensitive is ignored and the fitted words are searched with the compile object as given. If string is passed and case_sensitive is True, search for an exact substring match of the passed string; if case_sensitive is False, search without regard to case.

If a substring match is not found, return an empty list. If matches are found, return a 1D list of the matches in their original form from the fitted data.

This is only available if store_uniques is True. If False, the unique strings that have been fitted on the TS instance are not retained therefore cannot be searched, and an empty list is always returned.

Parameters:
patternstr | re.Pattern[str]

Character sequence or regular expression in a re.compile object to be looked up against the strings fitted on the TextStatistics instance.

case_sensitivebool, default = True

Ignored if a re.compile object is passed to pattern. If True, search for the exact pattern in the fitted data. If False, ignore the case of words in uniques while performing the search.

Returns:
list[str]:

List of all strings in the fitted data that contain the given character substring. Returns an empty list if there are no matches.

property overall_statistics_#

Get the overall_statistics_ attribute.

A dictionary that holds information about all the strings fitted on the TextStatistics instance. Available statistics are size (number of strings seen during fitting), uniques count, average string length, standard deviation of string length, maximum string length, and minimum string length. If store_uniques is False, the uniques_count field will always be zero.

Returns:
overall_statistics_dict[str, numbers.Real]

Summary information about all the strings seen during fit.

partial_fit(X, y=None)#

Batch-wise fitting of the TextStatistics instance.

The instance is not reset and information about the strings in the batches of training data is accretive.

Parameters:
XXContainer

A 1D list-like or 2D array-like of strings to report statistics for. Can be empty. Strings do not need to be in the pybear Lexicon.

yAny, default = None

A target for the data. Always ignored.

Returns:
selfobject

The TestStatistics instance.

print_character_frequency()#

Print the character_frequency_ attribute to screen.

Returns:
None
print_longest_strings(n=10)#

Print the longest strings in string_frequency_ to screen.

Only available if store_uniques is True. If False, uniques are not available for display to screen.

Parameters:
nint, default = 10

The number of top longest strings to print to screen.

Returns:
None
print_overall_statistics()#

Print overall_statistics_ to screen.

The uniques_count field will always be zero if store_uniques is False.

Returns:
None
print_shortest_strings(n=10)#

Print the shortest strings in string_frequency_ to screen.

Only available if store_uniques is True. If False, uniques are not available for display to screen.

Parameters:
nint, default = 10

The number of shortest strings to print to screen.

Returns:
None
print_startswith_frequency()#

Print the startswith_frequency_ attribute to screen.

Returns:
None
print_string_frequency(n=10)#

Print the string_frequency_ attribute to screen.

Only available if store_uniques is True. If False, uniques are not available for display to screen.

Parameters:
nint, default = 10

The number of the most frequent strings to print to screen.

Returns:
None
score(X, y=None)#

No-op score method.

Dummy method to spoof dask Incremental and ParallelPostFit wrappers. Verified must be here for dask wrappers.

Parameters:
XAny

The data. Ignored.

yAny, default = None

The target for the data. Ignored.

set_params(**params)#

No-op set_params method.

Returns:
selfobject

The TextStatistics instance.

property size_#

Get the size_ attribute.

The number of strings fitted on the TextStatistics instance.

Returns:
sizeint

The number of strings fitted on the TextStatistics instance.

property startswith_frequency_#

Get the startswith_frequency_ attribute.

A dictionary that holds the first characters and their frequencies in the first position for all the strings fitted on the TextStatistics instance.

Returns:
startswith_frequency_dict[str, int]

The first characters of every string seen during fit and their respective frequencies.

property string_frequency_#

Get the string_frequency_ attribute.

A dictionary that holds the unique strings and the respective number of occurrences seen during fitting. If the store_uniques parameter is False, this will always be empty.

Returns:
string_frequency_dict[str, int]

The unique strings seen during fitting and their frequency.

transform(X)#

A no-op transform method for data processing scenarios that may require the transform method.

X is returned as given.

Parameters:
XXContainer

The data. Ignored.

Returns:
XXContainer

The original, unchanged, data.

property uniques_#

Get the uniques_ attribute.

A 1D list of the unique strings fitted on the TextStatistics instance. If store_uniques is False, this will always be empty.

Returns:
uniques_list[str]

A 1D list of the unique strings seen during fitting.