PDF submissions auto-checker for teachers. Text can be chimeric at times, often formatted into files ranging from simple text files to complex objects. PDF file is one of the more complex files that may contain additional image objects aside from textual data. The automation of checking of PDF file can drastically improve efficiency, by removing redundancy of work, especially when running over multiple of files.
- cv2
pip install cv2
- Pillow
pip install Pillow
- PyMuPDF
pip install PyMuPDF
- Tesseract version 5.0.0.x or later, download appropriate binary file.
- Python-tesseract
pip install pytesseract
- Arkivist
pip install arkivist
- Maguro
pip install maguro
W3Schools / Python Python 3.9.5 Documentation first20hours / google-10000-english Wikipedia Word Frequency
$ py kymera.py -d "<path_to_pdf_files>" -o "<path_to_tesseract.exe>" -a "<path_to_answerkey.csv> -z 2 -s 1 -t 90 -w 1 -g 1"
1. -d <path>
- Full path of the dirctory containing the PDF files to analyze.
2. -o <*.exe>
- Full path of the tesseract executable file for OCR API.
3. -a <*.csv>
- Full path to the answer key file, must be in csv format.
4. -z <1>
- Zoom factor (1-5; default = 1), affects the quality of the OCR results. Higher value means higher quality, but may slow down the analysis.
5. -s <0>
- Spell check (0-1; default = 0), autocorrects mispellings on the OCR results.
6. -t <0>
- Threshold (1-100; default = 80), adjust tolearance level.
7. -w <0>
- Write to file (0-1; default = 0), write OCR results to file.
8. -g <0>
- Gather data (0-1; default = 0), gather character data from the documents.
"answer, enclosed in double quotes", score
"abcde", 1
"12345", 1
"hello", 1
...
"5 + 2 = 7", 5
- OCR, quality = low
- Spellcheck, ongoing
- Character collector
- Character classifier
- Grader, pending
- Accuracy = 0%
The repository name kymera
was inspired from the mythological creature called Chimera, an animal with body parts from different creatures.