
Dathena Challenge on Kaggle

Primary LanguageJupyter NotebookMIT LicenseMIT


To understand our project please read our pdf report



Pipeline.ipynb: Main notebook that create our pipline for this project (using the other modules) and that create the submission file

Data_frame_creation.ipynb : Read and exctract features from all documents and save it as a pandas dataFrame (in safe/df_file_final.csv)

Test.ipynb: Some unit test for utility functions

Train_sentiment_analyser.ipynb: Notebook to execute that train our sentiment annalyser CNN. Warning you better have a good GPU to train it. The training is done either on Amazon reviews or imdb movie review using the keras dataset.

Train_word_embeddings.ipynb: based on the Corpus (all document) this learns a vector representation for each lemma (word) and a mapping dictionary.

Python code (module)

extract_text.py: some function to extract text data from doc(x), pdf and xls(x) files

sentiment_analyzer.py: code containing the Convolutional NN made with Keras, including method to train and predict.

text_preprocessing.py: code to preprocess the text, like doing some Lemmatisation, vectorization, stop words removal as well as some regex cleaning.

text_summarization.py: code to extract interesting sentence about the companies.

util.py: Some utility function which doesn't find a place in other an file.


submission_mapper.csv: provided file slightly modified (name of the .doc containing parenthesis have changed)


safe: Contains checkpoint for faster exection of the code as pickle or csv for pandas.

dataset: Not filled, contains Amazon review dataset.

files: Contains the dataset of this project.

models: Contains the model computed thanks to the code.


pip install PyPDF2
pip install python-docx
pip install xlrd
pip install pdfrw
pip install sumy
pip install gensim
pip install nltk
pip install glob2

python -m spacy download en
import nltk