tombombadilom/S4WM-extract-Python

PythonMIT

S4WM-extract-Python

Python PDF Questions to JSON Parser

This code snippet is a Python script that extracts questions and answers from a PDF file and processes them into JSON format.

Script Workflow

The script performs the following steps:

Imports necessary libraries.
Defines constants for page ranges and directories.
Initializes logging configuration.
Defines functions for:
- Extracting questions to JSON
- Reading and processing files
- Cleaning answers file
- Creating and clearing directories
- Extracting text from PDF pages
- Cleaning HTML tags and page numbers
- Extracting choices from question text
- Cleaning redundant or less suitable choice texts
- Separating question and choices
- Parsing questions with choices
- Parsing question items
- Parsing questions and answers
- Checking if a choice sequence is valid
- Parsing choices
- Saving data to JSON
- The main function
Configures logging and creates/clears output directories.
Extracts text from relevant pages of the PDF file.
Saves extracted text to files.
Cleans the answers text file and saves it to a new file.
Extracts answers from the cleaned text file and serializes them to JSON.
Saves the serialized JSON to a file.
Parses questions and prepares the data structure.
Saves questions with matched answers to a JSON file.

The main function is called to execute the script.