Content Extraction

Overview

This repository contains scripts to extract text from PDF textbooks.

Setup

  1. Clone the repository: git clone https://github.com/francoismavunila/content-extraction.git
  2. Create a virtual environment python -m venv venv
  3. activate your virtual environment .\venv\Scripts\Activate.ps1
  4. Install dependencies: pip install -r requirements.txt

Usage

  • Place your PDF files in the text_books/ directory.
  • Run scripts/extract_text.py to extract text into the extracted_texts/ directory.

Selected Text Books

  1. Physics: "BASIC PHYSICS FOR SECONDARY SCHOOLS" link here
  2. Chemistry: "Chemistry for Secondary Schools" link here
  3. Biology: "MODERN BIOLOGY" link here