/linguisticparser

Language analyzer for the segmentation of words, acronyms, sentences and paragraphs

Primary LanguagePython

Linguisticparser

Linguisticparser allows you to segment paragraphs, sentences and words without making mistakes with acronyms and abbreviations.

You can also transform a text into a dataframe of sentences that include paragraph information.

Installing

pip install git+https://github.com/ortizfuentes/linguisticparser

Using

from linguisticparser.textparser import TextParser
mytext = 'I am a text example. This is a sentence. This is another sentence. \n\n This is another paragraph. The letters e.g mean for example.'
tp = TextParser(mytext, text_name='example')
tp.text2df_tokenize()
  text_name  num_paragraph  ...  num_subsentence                           wordings
0   example              1  ...                1               I am a text example.
1   example              1  ...                1                This is a sentence.
2   example              1  ...                1          This is another sentence.
3   example              2  ...                1         This is another paragraph.
4   example              2  ...                1  The letters e.g mean for example.

[5 rows x 5 columns]

Other functions

clean_text(text)

from linguisticparser.textparser import TextParser
mytext = 'I      am a text example. This is a sentence. This is another sentence. \n\n This is another paragraph. The letters e.g mean for example.'
mytext = TextParser.clean_text(mytext)
print(mytext)
I am a text example. This is a sentence. This is another sentence. \n This is another paragraph. The letters e.g mean for example.

paragraphs_tokenize(text)

from linguisticparser.textparser import TextParser
paragraphs = TextParser.paragraphs_tokenize(mytext)
print(paragraphs)
['I am a text example. This is a sentence. This is another sentence.',
 'This is another paragraph. The letters e.g mean for example.']```

sentence_tokenize(paragraph)

from linguisticparser.textparser import TextParser
sentences  = TextParser.sentence_tokenize(paragraphs[1])
print(sentences)
['This is another paragraph.', 'The letters e.g mean for example.']

word_tokenize(sentence)

from linguisticparser.textparser import TextParser
words  = TextParser.word_tokenize(sentences[1])
print(words)
['The', 'letters', 'e.g', 'mean', 'for', 'example.']