This repository contains scripts to extract text from PDF textbooks.
- Clone the repository:
git clone https://github.com/francoismavunila/content-extraction.git
- Create a virtual environment
python -m venv venv
- activate your virtual environment
.\venv\Scripts\Activate.ps1
- Install dependencies:
pip install -r requirements.txt
- Place your PDF files in the
text_books/
directory. - Run
scripts/extract_text.py
to extract text into theextracted_texts/
directory.