This is a source code used by the team French Eagles for Kaggle "Home Depot" competition. It's also a project for Information Retrieval and Data Mining 2016 module at UCL.
- Clone this repository
- Create 'data' directory in project root directory as well as in AllenLucene
- Download and unzip data files from Kaggle competition website to 'root' data directory.
- Run before_lucene.py. It should create files AllenLucene/data/train.csv, AllenLucene/data/test.csv and directory AllenLucene/data/files with 124428 files.
- Open AllenLucene project in IntelliJ IDEA or other Java IDE.
- Include the follwing jars in your project: opencsv-3.7, lucene-queryparser-5.4.11, lucene-demo-5.4.11, lucene-core-5.4.11, lucene-analyzers-common-5.4.11.
- Run IndexFiles.java. It should create directory AllenLucene/data/index.
- Run SearchFiles.java. It should create files AllenLucene/data/lucene_train.csv and AllenLucene/data/lucene_test.csv.
- Run preprocess_forum_stem.py. It should create files data/features_train.csv and data/features_test.csv.
- Run learn.py. It should create file data/my_submission.csv.
- Submit my_submission.csv file on kaggle and enjoy your result!