Kaggle_AllenAIscience: A Python repository from se4u

#################################################################################################
# Kaggle "The Allen AI Science Challenge" competition
# Oct 2015 - Feb 2016
# Kaggle username: "Cardal" 
#################################################################################################

This file explains how to run the ai2_Cardal.py code, which generates my solution for the
Kaggle AI2 competition.

1. System and Dependencies
   The code is written in python. I used Python 2.7.7 on Windows 7 Professional (8GB RAM) for 
   most of the executions. 
   The following python libraries are used by the script:
     numpy (1.8.1), scipy (0.14.0), pandas (0.14.0), sklearn (0.15.2), nltk (3.0.5)
   Since I couldn't get PyLucene to work on my system, I ran the parts that use Lucene on 
   a Linux machine (16GB RAM) with Python 2.6.6 and the following libraries:
     numpy (1.8.1), scipy (0.14.0), pandas (0.15.2), sklearn (0.16.0), nltk (3.1), lucene (pylucene-4.7.2-1)
   
2. Files
   Extract the archive to a directory on your system. 
   Copy the test set file to the "input" folder and change the settings as described below.
   
3. Resources
   The model uses several resources that were approved in the competition forum (in
   the "External Data Repository" thread).
   All files are supplied in the "corpus" folder, except the Wikipedia dumps and the pdf
   text books from www.ck12.org (see below).
   (a) AI2 - data downloaded from the AI2 web-site: http://allenai.org/data.html
       I extracted all the text from these datasets and organized it into a single
       corpus file at corpus/AI2_data/ai2_corpus.txt 
   (b) CK12 - I downloaded two types of resources from the http://www.ck12.org: 
       First, I used the HTML files in the EPUB file "Concepts_b_v8_vdt.epub".
       Second, I downloaded the science-related books in pdf format ("FlexBooks"), and converted them
       into text files using MS Word (with the "Save as" option). The converted files are 
       under corpus/CK12 with the ".text" suffix; the original pdf books are quite large, so they
       aren't included here (I kept the original name, only replaced the suffix).
       I prepared four corpus files from the CK12 books:
       corpus/CK12/OEBPS/ck12.txt - based on the HTML files in "Concepts_b_v8_vdt.epub";
       corpus/CK12/OEBPS/ck12_paragraphs.txt - as above, but using a different criterion
          for splitting the text into sections (paragraphs);
       corpus/CK12/ck12_text.txt - based on the pdf files;
       corpus/CK12/ck12_text_sentences.txt - as above,  but using a different criterion
          for splitting the text into sections (sentences).
   (c) Quizlet - study cards from quizlet.com. I parsed the cards (e.g., in order to remove
       wrong answers from multiple-choice questions), and manually editted some of them (e.g.,
       to fix format problems). The full corpus is in corpus/quizlet/quizlet_corpus.txt
   (d) Saylor and OpenStax - books in pdf format from Saylor Academy (http://www.saylor.org/books/) 
       and OpenStax College (https://www.openstaxcollege.org/books).
       I extracted the text from the pdf files using MS Word or (for large files) the pdfminer tool ("pdf2txt.py").
       The full corpus is in corpus/Saylor/saylor_text.txt
   (e) SimpleWiki - I downloaded the SimpleWiki dump file "simplewiki-20151102-pages-articles.xml" (496MB)
       from https://dumps.wikimedia.org/simplewiki/20151102/, and constructed three corpus files by extracting 
       some sections from subsets of the pages. The files are under corpus/simplewiki:
          simplewiki_1.0000_0.0500_0_5_True_True_True_corpus.txt (117MB)
          simplewiki_1.0000_0.1000_0_3_True_True_False_corpus.txt (114MB)
          simplewiki_1.0000_0.0100_0_3_True_True_False_pn59342_corpus.txt (4.4MB)
   (f) StudyStack - study cards from http://www.studystack.com. I parsed the cards (e.g., to
       remove sentences marked as False), and saved four versions under corpus/StudyStack, each 
       containing a different subset of the cards:
          studystack_corpus.txt  (72MB)
          studystack_corpus2.txt  (4MB)
          studystack_corpus3.txt  (9MB)
          studystack_corpus4.txt (17MB)
   (g) UtahOER - Utah Science Open Educational Resources textbooks downloaded from
       http://www.schools.utah.gov/CURR/science/OER.aspx. I extracted the text from the pdf
       files using the pdfminer tool. The entire corpus is in corpus/UtahOER/oer_text.txt
   (h) Wiki - I downloaded the Wiki dump file "enwiki-20150901-pages-articles.xml" (50.6GB)
       from https://dumps.wikimedia.org/enwiki/20150901/. I created two corpus files, containing
       different sections from subsets of the pages, under corpus/wiki:
          wiki_1.0000_0.0200_0_5_True_True_False_corpus.txt (161MB)
          wiki_0.5000_0.1000_0_5_True_True_False_pn59342_corpus.txt (33MB)
   (i) Wikibooks - I downloaded the WikiBooks dump file "enwikibooks-20151102-pages-articles.xml" (560MB)
       from https://dumps.wikimedia.org/enwikibooks/20151102/, and created a corpus file with
       some sections from a subset of the pages:
          wikibooks_1.0000_0.0200_0_10_True_True_False_corpus.txt (126MB)

4. Settings
   The SETTINGS.json file contains the following parameters:
   "BASE_DIR"   - the directory into which you extracted the archive
   "INPUT_DIR"  - directory with input files (training, validation, test sets)
   "CORPUS_DIR" - directory with corpus files, organized in sub-folders
   "SUBMISSION_DIR"  - directory for the submission (solution) files
   "TRAINING_FILE"   - training set file
   "VALIDATION_FILE" - validation set file
   "TESTING_FILE"    - test set file

=> IMPORTANT: Please change "BASE_DIR" and "TESTING_FILE" to the actual names of the directory in
              which you extracted the files and the test set file, respectively. The other
              settings should probably not be changed.
   
5. Execution
   The entire code is in the "ai2_Cardal.py" script. There are two phases for the execution.

   Phase A: Feature preparation
     In this phase, the script computes features for all questions and answers. There are 28 sets
     of features. Set 0 are "basic", in the sense that they are extracted from the questions
     and answers themselves (in the entire training set), without any additional corpus. 
     Each of the other 27 feature sets consists of one or two features that are computed 
     by searching for words from the questions and answers in one or more corpora.
     Features 1-20 use my search functions, and some of them are quite slow.
     Features 21-27 use PyLucene with the index files supplied in corpus/lucene_idx[1-7] (if you
     remove these files, the code will prepare the indexes from the corpus files).
     In order to speed-up the total runtime by utilizing multiple CPUs, each set of features can
     be computed separately (and in parallel), by running: "python ai2_Cardal.py prep [0-27]"
     Thus, you should run: 
        python ai2_Cardal.py prep 0
        python ai2_Cardal.py prep 1 
          ...and so on until:
        python ai2_Cardal.py prep 27
     Some features take minutes to calculate, others could take several hours or more.
     The model archive contains features computed for the training and validation sets (in the 
     folders "funcs_train" and "funcs_valid", respectively), so running "prep" as above will
     calculate the features only for the test set. In order to re-compute features, simply
     delete the relevant files from the "funcs_" directories.
     Note: In order to save computations, the code checks whether a question in the test set
           appears in the validation set; if so, it copies the scores each answer attained,
           instead of re-computing them. This assumes that the common questions share the same
           ID in both the validation and test files.

   Phase B: Model construction
     After all the features have been calculated, you should train the model and build the
     solution using:
        python ai2_Cardal.py run
     (Note: This will also compute any missing features that haven't been prepared in the
      previous phase)
     The "run" command first performs cross-validation on the training set and reports the
     scores (AUC and accuracy). It then trains the model on the complete training data and
     predicts the answers for both the validation and test sets. These solutions are 
     written to files in the "submission" folder.
     The training cross-validation scores should be as follows (by default, the code runs 
     only one CV iteration):
           CV scores   (mean = 0.83054): 0.83054
           CV accuracy (mean = 0.61600): 0.61600
     The model's predictions for the validation set should be:
         validation predictions (8132):
           102501: C (1.34e-01, 6.04e-02, 6.57e-01, 6.17e-02)
           102502: D (1.18e-01, 2.73e-01, 9.23e-02, 4.63e-01)
           ...
           102529: B (7.29e-02, 7.93e-01, 1.11e-01, 6.46e-02)
           102530: A (3.88e-01, 1.36e-01, 1.17e-01, 2.93e-01)
     The solution generated for the validation set in the "submission" directory should be 
     identical to the one supplied in the archive (submission/ai2_cardal_validation_20160204_1846.csv).
se4u/Kaggle_AllenAIscience