This code snippet is a Python script that extracts questions and answers from a PDF file and processes them into JSON format.
The script performs the following steps:
- Imports necessary libraries.
- Defines constants for page ranges and directories.
- Initializes logging configuration.
- Defines functions for:
- Extracting questions to JSON
- Reading and processing files
- Cleaning answers file
- Creating and clearing directories
- Extracting text from PDF pages
- Cleaning HTML tags and page numbers
- Extracting choices from question text
- Cleaning redundant or less suitable choice texts
- Separating question and choices
- Parsing questions with choices
- Parsing question items
- Parsing questions and answers
- Checking if a choice sequence is valid
- Parsing choices
- Saving data to JSON
- The main function
- Configures logging and creates/clears output directories.
- Extracts text from relevant pages of the PDF file.
- Saves extracted text to files.
- Cleans the answers text file and saves it to a new file.
- Extracts answers from the cleaned text file and serializes them to JSON.
- Saves the serialized JSON to a file.
- Parses questions and prepares the data structure.
- Saves questions with matched answers to a JSON file.
The main function is called to execute the script.