Ground Truth was born due to the need to reduce the lag in the systematization of sentences to do a right follow-up to the orders for The Special Administrative Unit for the Management of Restitution of Dispossessed Lands (Unidad Administrativa Especial de Gestión de Restitución de Tierras Despojadas - UAEGRTD) . They claimed that by 2020, they are delay in systematizing around 900 sentences from years between 2013 and 2015.
The installation process can be found in the repository file "Guía de instalación SmartScan Ubuntu 1804 LTS.docx" in Spanish.
The elements of our Ground Truth solution are:
- PDF metadata obtained through a web scraping process from URT repositories.
- Downloading the PDF files
- Converting the PDF files
- Extracting the information from the converted files
- Modeling the extracted data
- Dashboard with the Smart Scan interface
- Backend:
1.1 Folders
- Data: where fixed data is stored
- Tmp: here the documents in process are downloaded (subfolder /data), converted to text using ocr or txt library (subfolders /text and /ocr), temporal images of ocr (subfolder /images) and temporal output from some subprocesses (subfolder /output)
- Done: finished documents, both the pdf downloaded from web scrapping and the txt files convertes (ocr or text techniques)
1.2 Scripts
- common.py: configuration variables through the process, like paths, file names, sql string, dictionary, url source, etc
- 1_web_scraping.py: explore the url source looking for new sentences published
- 2_download.py: makes the download of the new pdf files and keep them locally to process later
- 3_pdf_conversion.py: this step converts the file to txt using OCR or TXT conversion techniques (calculated the quality of the conversion) and keep the result in tmp folder
- 4_table_extraction.py: this is the main extraction program, it retrieves from the text (using NLP) the following information: requested, family group, settled ID, latitude and longitude of the land, town, state and municipality of the land, land ID, real estate registration, resolution items, entities involved into judge orders and benefits for the requester
- 5_oppositor_extraction.py: it extracts from the text if there are some opponent to the request, and the judge and/or magistrate how dictated the resolution
- 6_full_index_extraction.py: this one generates automatically an index of the PDF, looking for titles, subtitles and following the number conventions of the texts. It is also sent to database to make it available to end user
- 7_solution_extraction.py: it is an extraction of the text focused only to the resolution, where every item is extracted to validate if an entity is involved in the item and which benefits were given to the requester
- 8_solution_index_extraction.py: focused only in the subitems of the resolution part. Stored into database to make it available to end user
- 9_prediction_model.py: it used the previously models to predict the 13 replies to the business. These replies (YES/NO) are shown to users in the application
- model_setup_training.py: this part of the process was only used in the development of the solution, and it persists the models to future usage in production. Model are stored as pickled object into folder data
- Frontend: The full front end of our solution, it contains the standard folders like apps, assets, datasets and images, and also the index.py as the starting script of the application.
In general, the below flow explain the main features of the solution
Also, the conversion process works in this way:
See the below arquitecture to have a summary, but we invite you to see this document
Check also the ER model here (you can also open ER model image to zoom it):