/Data-Mining-Assignment-TICNN

Repository containing code for paper implemented for Data Mining Assignment

Primary LanguageJupyter Notebook

This document contains instructions to rebuild preprocessed data, reproduce results and test our pre-trained hdf5 models. TICNN_REPORT contains the technical documentation of our implementation combined with description of the novel TI-CNN-TITLE-1000 model proposed and experimental results obtained.

To obtain preprocessed dataset from original dataset, refer to INITIAL_PREPROCESSING/ folder. TICNN_Preprocessing.ipynb preprocesses the original dataset to save final_text_df.pkl (final dataframe to be used for text only models) and final_image_df.pkl (intermediary dataframe contains datapoints of which images were retrieved succesfully, along with explicit image features). Pre-trained caffe model is used for obtaining image explicit attributes (refer to the report).
Preprocessing_image_files.ipynb preprocesses final_image_df.pkl to obtain df_final_new.pkl dataframe containing the preprocessed images ready to use in the models using the image data. To skip these steps, kindly download these pickle files from following links and place these files in a folder named TICNN_Implementation/TICNN/ in TICNN_Implementation folder for smooth functioning of the code.

We present code for four models namely GRU-400, LSTM-400, CNN-Text-1000 and our novel TICNN-TITLE-1000 model which is improved version of TICNN-1000 model mentioned in original paper. To further see details of these models refer to documentation and presentations provided. GRU-400 notebook , CNN-text-1000 notebook, LSTM-400 notebook, TICNN-TITLE-1000 notebook contains training code for respective models. All models require publicly available glove-100 file for the embedding layer. Preprocessed pickle files from previous section are also required depending on model's modality. Please place these files in your mounted google drive while running these collab notebooks. Hdf5 files for the pretrained models will be automatically saved, which can be used for inference. All the models have been cross-validated. To skip these steps, please use already trained hdf5 files provided here. For directory structure please refer to the drive link.

GRU-400 notebook , CNN-text-1000 notebook, LSTM-400 notebook, TICNN-TITLE-1000 notebook contains code for inference on respective models on the test set. If only inference is to be verified then following pretrained models can be used directly.

Credits

This project was done as partial requirement for Data Mining Course under Dr. Yashwardhan Sharma, Bits-Pilani, Pilani Campus. Contributors are:- Naman Goenka, Himanshu Pandey, Ayush Singh, Harshita Gupta (All contributors have contrbuted equally).