Tokenizer for Hindi
This package tends to implement a Tokenizer and a stemmer for Hindi language.
To import the package,
from HindiTokenizer import Tokenizer
This package implements various funcions, which are listed as below:
- read_from_file
- generate_sentences
- tokenize
- generate_freq_dict
- generate_stem_word
- generate_stem_dict
- remove_stopwords
- clean_text
- print_sentences
- print_tokens
- print_freq_dict
- print_stem_dict
- len_text
- sentence_count
- tokens_count
- concordance
The Tokenizer can be created in two ways
t=Tokenizer("यह वाक्य हिन्दी में है।")
Or
t=Tokenizer()
t.read_from_file('filename_here')
A brief description about all the functions
This function takes the name of the file which is present in the current directory and reads it.
t.read_from_file('hindi_file.txt')
Given a text, this will generate a list of sentences.
t.generate_sentences()
This will print the sentences generated by print_sentences.
t.generate_sentences()
t.print_sentences()
This will generate a list of tokens from the given text
t.tokenize()
This will print the sentences generated by print_tokens.
t.tokenize()
t.print_tokens()
This will generate a dictionary of frequency of words and return it.
freq_dict=t.generate_freq_dict()
This will print the dictionary of frequency of words generated by generate_freq_dict.
freq_dict=t.generate_freq_dict()
t.print_freq_dict(freq_dict)
Given a word, this will generate its stem word.
word=t.generate_stem_word("भारतीय")
print word
भारत
This will return the dictionary of stemmed words.
stem_dict=t.generate_stem_dict()
This will print the dictionary of stemmed words generated by generate_stem_dict.
stem_dict=t.generate_stem_dict()
t.print_stem_dict(stem_dict)
This will remove all the stopwords occuring from the given text.
t.remove_stopwords()
This will remove all the punctuation symbols occuring in the given text.
t.clean_text()
Given a text, this will return the length of it.
print t.len_text()
Given a text, this will return the number of sentences in it.
print t.sentence_count()
Given a text, this will return the number of tokens in it.
print t.tokens_count()
Given a text, and a word, it will print all the sentences where that word is occuring.
sentences=t.concordace("हिन्दी")
t.print_sentences(sentences)