Mills Test Reports (MTRs) Information Extraction
The initial attempt at digitalizing MTRs database out of hard copies. The ultimate goal is to build comprehensive digital databases out of different scanned forms, and to make materials qualtiy control & examination tasks easier and more efficient for engineers.
Prerequisites
OCR:
- [Tesseract-OCR]: Installation guide on Windows could be found here.
- [pytesseract]
- [PIL]
PDf & Text processing:
- [tabula]
- [PyPDF2]
- [shutil]
- [fuzzywuzzy]
- [re]
- [cv2]
Generic:
- [Numpy]
- [pandas]
- [csv]
- [json]
- [timeit]
How it works
- The idea, simply, is to extract the essential content, i.e. chemical analysis, and later all useful contents from MTRs, and organize then into a structured database that can be deployed for multiple purposes.
- We are utilizing tesseract-OCR (optical character recognition), and then NLP(natural language processing) to build the preliminary database.
- The raw data: scanned MTRs that come in all different format
- The process:
- Define the libraries for sub-vendor location extractions
- Split pdfs into single files and convert to text
- Based on converted text, fuzzy matching su-vendors. If top quality (class1), extract the content directly using pdf tools to text files, and move the files to top class folder; if not, crop the pdf based on the location information
- Convert the cropped files to text files. Based on quality of content, decide whether to move it to 2nd quality folder or not (stays as the bottom qualty file)
- Add 'ID' field with the file name & page information, and 'class' field based on the folder where the file lies.
- Parse all files from class 1 to a preliminary database: split, merge, and match columns, get rid of useless column and rows, transpose side-by-side contents, use regular expressions to double-check that the content extracted is chemical analysis.
- Export the database to non-coder friendly .csv file
File Organizations
- InformationExtraction: Main folder
- code: python files
- demos: jupyter notebook with all content and corresponding converted html file
- dictionaries: defined location dictionaries to extract information from
- output: sequential processed output from the program
- Data: raw data. MTRs from vendors, each file contains multiple pages that are usually multiple MTRs from several sub-vendors who provide materials to the same vendors
- StudyMaterials: project related readings
- PresentationSlides: all touchdown slides
- Experiments: previous explorations, including IBM Watson test, test on old samples etc. that are not included in the final model
- README: this file
- README for non_coders: more detailed documentation in a layman's language
To-dos
- Orientation detection
- Class 2 database clean-up
- Ask for experts' inputs when the confidence about extracted content is low
Acknowledgments
- Special thanks to Deval for guidance along the way
- To Patrick and Xin providing domain expertise
- To Ming providing sights
- And to all rest of people here that make my intership at Shell extra wonderful
- To my friends from Microsoft & IBM for suggestions and ideas
Authors
- Ginny Zhu - Initial work - github profile
- Coded by Ginny, with love :)
--Due to privacy issues, actual content and data cannot be uploaded.