Paper: Improving accuracy and speeding up Document Image Classification through parallel systems
SmallTobacco files can be downloaded here. In Data folder we provide the scripts for getting ocr .txt files (ocr_tobacco.py) and for creating .hdf5 files (ST_hdf5_dataset_creation.py) with images and ocr data.
BigTobacco files can be downloaded here. ./Data/BT_hdf5_dataset_creation.py creates train, test and validation .hdf5 files based on the aforementioned link partition.
├── image_model
├── eff_big_training.py # EfficientNet training in BigTobacco
├── eff_small_training.py # EfficientNet training in SmallTobacco
├── eff_utils.py # EfficientNet helper with common functions for Small and Big training
├── H5Dataset.py # Dataset class reading hdf5 file
├── tensorflow
├── distr_effnet_shear.py # EfficientNet
├── text_model
├── main.py # BERT training in SmallTobacco
├── bert_utils.py # BERT helpers
├── training_modules
├── data_utils.py # data cleaning and H5Dataset class
├── finetuned_models.py # BERT model definition
├── model_utils.py # train and test procedures
├── text_model
├── ensemble.py # ensemble image and text predictions
├── bert_utils.py # BERT helpers
├── ensemble_modules
├── data_utils2.py # data cleaning and H5Dataset_ensemble class
├── model_utils_ensemble.py # BERT and EfficientNet predictions and ensemble
efficientnet_pytorch library downloads the models in .cache/torch/checkpoints.
pytorch_transformers library does it in .cache/torch/pytorch_transformers. Make sure you previously download and store in those paths the models if your machine has no internet access.