Exams data extractor w/ Machine Learning.
This app is using vanilla Python 3.10
, leveraging VS Code's Dev Containers (a.k.a Remote Containers) + Docker Compose
for its environment.
For Machine Learning, this project uses the best pre-trained Portuguese Text Recognition model from tesseract-ocr
, enabled by the pytesseract
Python wrapper.
- Docker
version >= 23.0
(Recommended to use the latest version if possible) - VS Code
- That's it!
- Clone the project,
cp .env.example .env
then open it on yourVS Code
. - Install the
ms-vscode-remote.remote-containers
VS Code extension (if not already). - Open the Command Pallete CMD / CTRL + SHIFT + P then search and run
Dev Containers: Reopen in Container
.
This will start building the
Dev Container
usingdocker-compose.yml
.
Once it's built, use the terminal within VS Code to execute other commands using
./run
.Ensure that VS Code has selected the correct Python interpreter within your new environment:
/usr/local/bin/python
.
- Use
./run app
to execute the mainsrc/app.py
script at anytime.
Note
Use
./run test
run tests.You can also append
-k [name]
to run tests matching the provided name in isolationHint: The
./run app
command essential does./run python src/app.py
Warning
It is recommended to develop within a
Dev Container
as you'll be able to use the Python interpreter installed within your container, and correct IDE dependency suggestions and links via the pre-installedpylance
extension.
As a side-note, you can manually build your the docker container using
./run build
if you don't fancy usingDev Containers
, and use the same./run
commands as you would, with the only difference being that your IDE will struggle to find the installed dependencies.
- Download XQuartz, then install and reboot your machine.
- Open XQuartz.app > Settings (from the menu bar) > Security > And tick both "Authenticate connections" & "Allow connections from other network clients".
- Run
xhost + localhost
- Keep XQuartz open (at least whenever you're developing this codebase)
- Run
- Install socat via
brew install socat
- Run
socat TCP-LISTEN:6000,reuseaddr,fork UNIX-CLIENT:/tmp/.X11-unix/X0
on a separate terminal, and let it listen. - Any GUI interface from this codebase should now work.
On Linux, X11 should just work requiring further configurations.
./run test
to run tests../run python
: Executepython
commands. You can append-w
to watch the file for changes (hot module reload). It usesbreuleux/jurigged
under the hood../run pip
: Executepip
commands (./run pip install ...
and./run pip uninstall ...
are disabled. You should use the managed./run pip:install ...
and./run pip:uninstall ...
instead.- To add a new dependency, either manually add them to
requirements.in
, then./run pip:install
or pass the dependencies as its arguments if you don't want to manually modifyrequirements.in
. The same applies to./run pip:uninstall
. - This project uses
pip-tools
to manage its dependencies.
- To add a new dependency, either manually add them to
./run bash
to enterbash
within the container when outside of theDev Container
environment.
The goal of this project is to automate the process of extracting questions and answer options from any Mozambican exam paper.
Full OCR Sample (using tesseract-ocr and the best Portuguese pre-trained model) |
Current extraction approach sample |
---|---|
This stack is adapted from: