Lexicon#

class pybear.feature_extraction.text.Lexicon#

Bases: TextStatistics

The pybear lexicon of words in the English language.

Not exhaustive, though attempts have been made. This serves as a list of words in the English language for text-cleaning purposes. Lexicon also has an attribute for pybear-defined stop words.

The published pybear Lexicon only allows the 26 letters of the English alphabet and all must be capitalized. Other characters, such as numbers, hyphens, apostrophes, etc., are not allowed. For example, entries one may see in the pybear Lexicon include “APPLE”, “APRICOT”, “APRIL”. Entries that one will not see in the published version are “AREN’T”, “ISN’T” and “WON’T” (the entries would be “ARENT”, “ISNT”, and “WONT”.) Lexicon has validation in place to protect the integrity of the published pybear Lexicon toward these rules. However, this validation can be turned off and local copies can be updated with any strings that the user likes.

pybear stores its lexicon and stop words in text files that are read from the local disk when a Lexicon class is instantiated, populating the attributes of the instance. The lexicon files are named by the 26 letters of the English alphabet, therefore there are 26 Lexicon files. Words are assigned to a file by their first letter.

The add_words() method allows users to add words to their local copies of the Lexicon, that is, write new words to the Lexicon text files. The validation protocols that are in place secure the integrity of the published version of the pybear Lexicon, and the user must consider these when attempting to change their local copy. When making local additions to the Lexicon via add_words, this validation can be turned off via file_validation, character_validation, and majuscule_validation keyword arguments. These allow your Lexicon to take non-alpha characters, upper or lower case, and allows Lexicon to create new text files for itself.

Attributes:

size_: Get the size_ attribute.
lexicon_: A list of all the words in the pybear Lexicon.
stop_words_: The list of pybear stop words.
overall_statistics_: Get the overall_statistics_ attribute.
string_frequency_: Get the string_frequency_ attribute.
startswith_frequency_: Get the startswith_frequency_ attribute.
character_frequency_: Get the character_frequency_ attribute.
uniques_: Get the uniques_ attribute.

Methods

`add_words`(WORDS[, character_validation, ...])	Silently update the pybear Lexicon text files with the given words.
`check_order`()	Determine if the lexicon files are out of alphabetical order.
`delete_words`(WORDS)	Remove the given word(s) from the pybear Lexicon text files.
`find_duplicates`()	Find any duplicates in the Lexicon.
`fit`(X[, y])	Blocked.
`fit_transform`(X)	Blocked.
`get_longest_strings`([n])	The longest strings seen by the TextStatistics instance during fitting.
`get_params`([deep])	Blocked.
`get_shortest_strings`([n])	The shortest strings seen by the TextStatistics instance during fitting.
`lookup_string`(pattern)	Use string literals or regular expressions to look for full word matches in the Lexicon.
`lookup_substring`(pattern)	Use string literals or regular expressions to look for substring matches in the Lexicon.
`partial_fit`(X[, y])	Blocked.
`print_character_frequency`()	Print the `character_frequency_` attribute to screen.
`print_longest_strings`([n])	Print the longest strings in `string_frequency_` to screen.
`print_overall_statistics`()	Print `overall_statistics_` to screen.
`print_shortest_strings`([n])	Print the shortest strings in `string_frequency_` to screen.
`print_startswith_frequency`()	Print the `startswith_frequency_` attribute to screen.
`print_string_frequency`([n])	Print the `string_frequency_` attribute to screen.
`score`(X[, y])	Blocked.
`set_params`([deep])	Blocked.
`transform`(X)	Blocked.

Examples

>>> from pybear.feature_extraction.text import Lexicon
>>> Lex = Lexicon()
>>> round(Lex.size_, -4)
70000
>>> Lex.lexicon_[:5]
['A', 'AA', 'AAA', 'AARDVARK', 'AARDVARKS']
>>> Lex.stop_words_[:5]
['A', 'ABOUT', 'ACROSS', 'AFTER', 'AGAIN']
>>> round(Lex.overall_statistics_['average_length'], 0)
8.0
>>> Lex.lookup_string('MONKEY')
'MONKEY'
>>> Lex.lookup_string('SUPERCALIFRAGILISTICEXPIALIDOCIOUS')
>>>
>>> Lex.lookup_substring('TCHSTR')
['LATCHSTRING', 'LATCHSTRINGS']

add_words(WORDS, character_validation=True, majuscule_validation=True, file_validation=True)#

Silently update the pybear Lexicon text files with the given words.

Words that are already in the Lexicon are silently ignored. This is very much a case-sensitive operation.

The ‘validation’ parameters allow you to disable the pybear Lexicon rules. The pybear Lexicon does not allow any characters that are not one of the 26 letters of the English alphabet. Numbers, spaces, and punctuation, for example, are not allowed in the formal pybear Lexicon. Also, the pybear Lexicon requires that all entries in the lexicon be MAJUSCULE, i.e., upper-case. The published pybear Lexicon will always follow these rules. When the validation is used, it ensures the integrity of the Lexicon. However, the user can override this validation for local copies of Lexicon by setting character_validation, majuscule_validation, and / or file_validation to False. If you want your Lexicon to have strings that contain numbers, spaces, punctuation, and have different cases, then set the validation to False and add your strings to the Lexicon via this method.

pybear stores words in the Lexicon text files based on the first character of the string. So a word like ‘APPLE’ is stored in a file named ‘lexicon_A’ (this is the default pybear way.) A word like ‘apple’ would be stored in a file named ‘lexicon_a’. Keep in mind that the pybear Lexicon is built with all capitalized words and file names and these are the only ones that exist out of the box. If you were to turn off the majuscule validation and file validation’ and pass the word ‘apple’ to this method, it will NOT append ‘APPLE’ to the ‘lexicon_A’ file, a new Lexicon file called ‘lexicon_a’ will be created and the word ‘apple’ will be put into it.

The Lexicon instance reloads the Lexicon from disk and refills the attributes when update is complete.

Parameters:

WORDSstr | Sequence[str]: The word or words to be added to the pybear Lexicon. Cannot be an empty string or an empty sequence. Words that are already in the Lexicon are silently ignored.
character_validationbool, default = True: Whether to apply pybear Lexicon character validation to the word or sequence of words. pybear Lexicon allows only the 26 letters in the English language, no others. No spaces, no hyphens, no apostrophes. If True, any non-alpha characters will raise an exception during validation. If False, any string character is accepted.
majuscule_validationbool, default = True: Whether to apply pybear Lexicon majuscule validation to the word or sequence of words. The pybear Lexicon requires all characters be majuscule, i.e., EVERYTHING MUST BE UPPER-CASE. If True, any non-majuscule characters will raise an exception during validation. If False, any case is accepted.
file_validationbool, default = True: Whether to apply pybear Lexicon file name validation to the word or sequence of words. The formal pybear Lexicon only allows words to start with the 26 upper-case letters of the English alphabet (which then dictates the file name in which it will be stored). If True, any disallowed characters in the first position will raise an exception during validation. If False, any character is accepted, which may then necessitate that a file be created.

Returns:

None

property character_frequency_#

Get the character_frequency_ attribute.

A dictionary that holds all the unique single characters and their frequencies for all the strings fitted on the TextStatistics instance.

Returns:

character_frequency_dict[str, int]: The counts of every character seen during fitting.

check_order()#

Determine if the lexicon files are out of alphabetical order.

Compare the words as stored against a sorted vector of the words. Displays any out-of-order words to screen and return a Python list of the words.

Returns:

list[str]:: Vector of any out-of-sequence words in the Lexicon.

delete_words(WORDS)#

Remove the given word(s) from the pybear Lexicon text files. Case sensitive! Any words that are not in the pybear Lexicon are silently ignored.

Parameters:

WORDSstr | Sequence[str]: The word or words to remove from the pybear Lexicon. Cannot be an empty string or an empty sequence.

Returns:

None

find_duplicates()#

Find any duplicates in the Lexicon.

If any, display to screen and return as Python dictionary with frequencies.

Returns:

dict[str, int]:: Any duplicates in the pybear Lexicon and their frequencies.

fit(X, y=None)#: Blocked.

fit_transform(X)#: Blocked.

get_longest_strings(n=10)#

The longest strings seen by the TextStatistics instance during fitting.

Only available if store_uniques is True. If False, the uniques seen during fitting are not available and an empty dictionary is always returned.

Parameters:

nint, default = 10: The number of the top longest strings to return.

Returns:

dict[str, int]:: The top ‘n’ longest strings seen by the TextStatistics instance during fitting. This will always be empty if store_uniques is False.

get_params(deep=True)#: Blocked.

get_shortest_strings(n=10)#

The shortest strings seen by the TextStatistics instance during fitting.

Only available if store_uniques is True. If False, the uniques seen during fitting are not available and an empty dictionary is always returned.

Parameters:

nint, default = 10: The number of the top shortest strings to return.

Returns:

dict[str, int]:: The top ‘n’ shortest strings seen by the TextStatistics instance during fitting. This will always be empty if store_uniques is False.

property lexicon_#

A list of all the words in the pybear Lexicon.

Returns:

uniqueslist[str]: A list of all the words in the pybear Lexicon.

lookup_string(pattern)#

Use string literals or regular expressions to look for full word matches in the Lexicon.

pattern can be a literal string or a regular expression in a re.compile object. Return a list of all words in the Lexicon that completely match the given pattern. Returns an empty list if there are no matches.

pybear Lexicon forces this search to be case-sensitive. If you pass a re.compile object with an IGNORECASE flag, this method strips that flag and leaves the other flags intact.

Parameters:

patternstr | re.Pattern[str]: Character sequence or regular expression in a re.compile object to be looked up against the pybear Lexicon.

Returns:

matcheslist[str]: List of all full words in the pybear Lexicon that match the pattern. Returns an empty list if there are no matches.

lookup_substring(pattern)#

Use string literals or regular expressions to look for substring matches in the Lexicon.

pattern can be a literal string or a regular expression in a re.compile object. Return a list of all words in the Lexicon that contain the given substring pattern. Returns an empty list if there are no matches.

pybear Lexicon forces this search to be case-sensitive. If you pass a re.compile object with an IGNORECASE flag, this method strips that flag and leaves the other flags intact.

Parameters:

patternstr | re.Pattern[str]: Character sequence or regular expression in a re.compile object to be looked up against the pybear Lexicon.

Returns:

matcheslist[str]: List of all words in the pybear Lexicon that contain the given character substring. Returns an empty list if there are no matches.

property overall_statistics_#

Get the overall_statistics_ attribute.

A dictionary that holds information about all the strings fitted on the TextStatistics instance. Available statistics are size (number of strings seen during fitting), uniques count, average string length, standard deviation of string length, maximum string length, and minimum string length. If store_uniques is False, the uniques_count field will always be zero.

Returns:

overall_statistics_dict[str, numbers.Real]: Summary information about all the strings seen during fit.

partial_fit(X, y=None)#: Blocked.

print_character_frequency()#

Print the character_frequency_ attribute to screen.

Returns:

None

print_longest_strings(n=10)#

Print the longest strings in string_frequency_ to screen.

Only available if store_uniques is True. If False, uniques are not available for display to screen.

Parameters:

nint, default = 10: The number of top longest strings to print to screen.

Returns:

None

print_overall_statistics()#

Print overall_statistics_ to screen.

The uniques_count field will always be zero if store_uniques is False.

Returns:

None

print_shortest_strings(n=10)#

Print the shortest strings in string_frequency_ to screen.

Only available if store_uniques is True. If False, uniques are not available for display to screen.

Parameters:

nint, default = 10: The number of shortest strings to print to screen.

Returns:

None

print_startswith_frequency()#

Print the startswith_frequency_ attribute to screen.

Returns:

None

print_string_frequency(n=10)#

Print the string_frequency_ attribute to screen.

Only available if store_uniques is True. If False, uniques are not available for display to screen.

Parameters:

nint, default = 10: The number of the most frequent strings to print to screen.

Returns:

None

score(X, y=None)#: Blocked.

set_params(deep=True)#: Blocked.

property size_#

Get the size_ attribute.

The number of strings fitted on the TextStatistics instance.

Returns:

sizeint: The number of strings fitted on the TextStatistics instance.

property startswith_frequency_#

Get the startswith_frequency_ attribute.

A dictionary that holds the first characters and their frequencies in the first position for all the strings fitted on the TextStatistics instance.

Returns:

startswith_frequency_dict[str, int]: The first characters of every string seen during fit and their respective frequencies.

property stop_words_#

The list of pybear stop words.

The words are the most frequent words in an arbitrary multi-million-word corpus scraped from the internet.

Returns:

_stop_wordslist[str]: The list of pybear stop words.

property string_frequency_#

Get the string_frequency_ attribute.

A dictionary that holds the unique strings and the respective number of occurrences seen during fitting. If the store_uniques parameter is False, this will always be empty.

Returns:

string_frequency_dict[str, int]: The unique strings seen during fitting and their frequency.

transform(X)#: Blocked.

property uniques_#

Get the uniques_ attribute.

A 1D list of the unique strings fitted on the TextStatistics instance. If store_uniques is False, this will always be empty.

Returns:

uniques_list[str]: A 1D list of the unique strings seen during fitting.