A program to extract translatable sentences from a corpus, based on a known vocabulary.
The vocabulary is stored in a CSV file.
- Project Gutenberg - a great place for public domain works
- Lingua - simple short English paragraphs for people learning English. The source of some of the texts in corpus.
- 1000 most common words in English - a good starting point for a vocabulary, perhaps. The source of vocabulary-1000.csv.
(NB. You might want to set up a virtualenv first)
pip install -r requirements.txt
pip install -r requirements_dev.txt
python scripts/nltk_download.py
Run the tests:
flake8
pytest
Run the program on a sample corpus:
python sentencer/main.py my-day