francoismavunila/content-extraction

Python

Content Extraction

Overview

This repository contains scripts to extract text from PDF textbooks.

Setup

Clone the repository: git clone https://github.com/francoismavunila/content-extraction.git
Create a virtual environment python -m venv venv
activate your virtual environment .\venv\Scripts\Activate.ps1
Install dependencies: pip install -r requirements.txt

Usage

Place your PDF files in the text_books/ directory.
Run scripts/extract_text.py to extract text into the extracted_texts/ directory.

Selected Text Books

Physics: "BASIC PHYSICS FOR SECONDARY SCHOOLS" link here
Chemistry: "Chemistry for Secondary Schools" link here
Biology: "MODERN BIOLOGY" link here