The Monocorpus project aims to provide tools for developing a Tatar language monocorpus. The project includes functionality to extract texts from books and save them in files.
- Extract text from EPUB and PDF files.
- Post-processing of extracted text to remove unwanted characters (e.g. OCR artifacts). More precisely, the following steps are performed:
- Remove sudden ASCII chars in the tatar word (e.g. с[0x0063)]у --> с[0x0441]у)
- Remove sudden non-ASCII chars in the non-tatar word (e.g. а[0x0430]rm --> a[0x0061]rm)
- Unify punctuation marks by replacing look-alikes with a single variant (e.g. '»' | '«' | '“' | '”' | '„' --> '"')
- Remove unwanted characters (e.g. '•')
- Remove sudden digits at the end of the word (e.g. башына2 —> башына)
To get started with the project, follow these steps:
- Clone the Repository:
git clone https://github.com/neurotatarlar/monocorpus.git
cd monocorpus
- Prepare Python Environment:
- Make sure you have Python 3.x installed on your system.
- Create and activate a virtual environment (optional but recommended):
python3 -m venv venv
source venv/bin/activate
- Install the required dependencies:
pip install -r requirements.txt
- Extract Texts from Books:
- Place your book(s) into the
workdir/000_entry_point
folder. Currently we support EPUB and PDF formats. - Run the script to extract texts:
python src/main.py extract
- Proces dirty extracted texts further:
- Run the script to process extracted texts:
python src/main.py process
- Explore the Output:
- Processed text files will be saved in the
workdir/900_artifacts
directory.
src/
: Contains the main script for text extraction and processing.workdir/000_entry_point
: Place your books here for text extraction.workdir/900_artifacts
: Processed text files will be saved here.requirements.txt
: List of required Python dependencies.
Contributions are welcome! If you'd like to contribute to the project, make your changes and submit a pull request detailing the changes made.
This project is licensed under the MIT License - see the LICENSE file for details.