- The extract text.py is the main python file and start_work function is the driver function that starts the code.
- The lexicon file should be a utf-8 text file that contains the words in the language.
- The text document that requires language text extraction should be a utf-8 text file.
- The code works by collecting a text file as an input,
- The text in the file are cleaned and split into sentences,
- The words in each sentence are matched with the language's lexicon and a score is given to a sentence,
- Based on the sentence score, the original sentence in text file (uncleaned) is written into a text file,
- The code outputs four text files with each file containing sentences based on their sentence score
- The four text files contain sentences with 25, 50, 75, and 100 sentence score.
πΏTell me more about the four text files
After running the code outputs four text files, The files are named based on their match with the words in the lexicon.
- π¨ The 100 percent text files contain sentences that match with a 100 percent - 74 percent score with the lexicon's language.
- The 75 percent text files contain sentences that match with a 75 percent - 51 percent score
- The 50 percent text files tend to contain mixed results,
- The 25 percent text files usually contain sentences that are #NOT# the same language with the lexicon's language.
- Move your lexicon text file and the language document text file to the code's directory
- change the string variables lexicon_txt and corpus_txt to the name of your lexicon text file and the language document text file respectively
- Run the code
- The code cleans diacritics and digits from sentences before scoring them. See the cleanText.py file.
- The code identifies sentences in text by using full stop (.), Edit the sentence_tokenizer.py if the desired language doesn't use dot to denote end of a sentence.
- The python program is designed to only make use of the python standard libraries so it can be easily ported to another system.
- The program also makes use of the python os library so that the program can cross platformly run on windows,linux based computers without having to worry about the file path differences i.e '/' and '\'.
Moses Bankole