/Ngrams

Code for creating nGrams from a given list of text

Primary LanguagePython

Last Updated March 20, 2018

NGrams is a simple script used to calculate nGrams from a given text corpus (a list of strings). The NGrams itself does not do any text processing on the data itself and simply calculates the ngrams. The Recursive implementation enables the short script to scale to any order of NGrams.

The projectRun.py file utilizes NLTK for tokenization, though the NGram script itself is Standard Python3

The class stores the following data which is accessible through simple get methods:

__lst_wordCorpus = ""             	# The entire word corpus
__lst_wordCorpusSet = []          	# The set of the word corpus
__int_corpusLength = 0			# length of the entire corpus
__int_corpusLengthSet = 0		# length of the set of the corpus (unique words)
__lstmap_gramProbabilities = []   	# list of maps corresponding to every gram in the corpus and their probabilities 
__lstmap_gramCounts = []          	# list of maps corresponding to every gram in the corpus and their probabilities 
__lstlst_orderedProbLst = []      	# list of pairs (ordered greatest to least by probability) of same word,probability relationship
__lstlst_orderedCountLst = []     	# list of pairs (ordered greatest to least by count) of same word,count relationship

The method list is as follows:

return type			function name			parameters
list<map<string, float>> 	get_gram_probabilities		(void)
int 				get_corpus_length		(void)
int 				get_corpus_set_length		(void)
map<string, float> 		get_gram_probability		(int_gram_order)
map<string, inr> 		get_gram_count			(int_gram_order)
list<pair<string, float>> 	get_gram_probability_ordered	(int_gram_order)
list<pair<string, int>> 	get_gram_count_ordered		(int_gram_order)
list<map<string, float>> 	calculate			(int_gram_order, int_threshold)
string 				random_sentence_base		(int_gram)
string 				random_sentence_next		(curr_sentence, int_gram)

function name			Elaboration 
get_gram_probabilities		Returns the list of all maps of string, floats. 
get_gram_counts			Returns the list of all maps of string, int. 
get_corpus_length		Returns length of corpus
get_corpus_set_length		Returns the number of unique words
get_gram_probability		Returns a map of <Word, Probabilities> of the given Ngram
get_gram_count			Returns a map of <Word, Counts> of the given Ngram
get_gram_probability_ordered	Returns a sorted list of pairs <Word,Probabilities> of the given Ngram
get_gram_count_ordered		Returns a sorted list of pairs <Word,Counts> of the given Ngram
calculate			Calculates all Ngrams from order 0 to the given param
random_sentence_base		Returns a valid base_phrase of the given Ngram
random_sentence_next		Returns a possible word based upon the given sentence and Ngram