InclusionCriteria

Instruction Creator: Xiaoru Dong, Linh Hoang
Date preparation: 12-14-2018
Manuscript working title: Machine classification of inclusion criteria from Cochrane systematic reviews
Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider

INSTRUCTIONS

Description:

These instructions describe the steps needed to replicate the results in the manuscript. There are 2 main sections:

  1. Python program: Steps to run the Python script. Python is used to generate features and to create the Weka input files corresponding to 3 feature extraction and selection approaches that we implemented in this study:
    • Features generated by the bag of words feature extraction strategy.
    • Features selected by the information gain feature selection strategy.
    • Features selected by a manual analysis feature selection strategy.
  2. Weka: Steps to run Weka to build the classifiers using three algorithms, NaiveBayes, Random Forest and J48.

1. Python program:

  • Programming Language: Python (version 3.0)

  • Please make sure that you have the following programs on your machine in order to run the script:

  • Please follow these steps in order to run the Python script:

    • Step 1: Download the source code from the GitHub site: https://github.com/XiaoruDong/InclusionCriteria/blob/master/code_classification_fall2018.ipynb

    • Step 2: Download the input file “Inclusion_Criteria_Annotation.csv” (one of the study’s data files), which is available at: https://doi.org/10.13012/B2IDB-5958960_V1 . Note where you store the file.

    • Step 2: Open the source code in Jupyter Notebook.

    • Step 3: Change the “path” variable in the source code to the path of your own folder where you stored the input file.

    • Step 4: Run the whole script to get all of the output files.

    • Step 5: Check the output files, which will be in the same folder where you stored the input file. There should be 7 output files:
      --> 2 data files (they are 2 out of 5 data files that we deposited and reported in the manuscript):
      “AllWords.csv”: list of all words (features) generated by Bag of words feature extraction strategy.
      "WordsSelectedByInformationGain.csv”: list of words (features) selected by Information Gain feature selection strategy.
      --> 3 Weka input files (which will be used to run classifiers in Weka later):
      “AllWords_weka_input.arff”: Weka input file to run classifiers with “All words” features.
      “WordsSelectedByInformationGain_weka_input.arff”: Weka input file to run classifiers with “Words selected by Information Gain” features.
      “ManualAnalysis_Words_weka_input.arff”: Weka input file to run classifiers with “Words selected by Manual Analysis” features.
      --> 2 temporary output files:
      “AllWord_Noredundant.csv”: a temporary file that contains a list of words after eliminating words with the same meanings
      “AllWord_Noredundant_weka_input.arff”: Weka input file to the Information Gain feature selection.

  • Notes

    • The other data file which is also reported in the manuscript, named “WordSelectedByManualAnalysis.csv” was created manually, not generated by the Python script. Therefore, it is not in the list of output files.
    • The two output files: “AllWord_Noredundant.csv” and “AllWord_Noredundant.arff” are considered as temporary files and not reported in the manuscript because they are just other versions of “AllWords” when words with the same meaning were eliminated. We used them as the input for Weka to run Information Gain feature selection, not for running classifiers.

2. Weka program:

  • Please make sure that you have Weka on your machine in order to implement the classifiers: https://www.cs.waikato.ac.nz/ml/weka/downloading.html

  • Steps to run Weka:

    • Step 1: Open Weka on your machine, select “Explorer” mode.
    • Step 2: On the “Preprocess” tab:
      --> Click “Open file” and select the Weka input file that you want to implement classification with. For example: if you want to implement a classifier with all features, select the “AllWords_weka_input.arff” Weka input file as shown in the screenshot below.
      1
      --> Click “All” to choose all of the words and use them as features to train the classifier as shown in the screenshot below.
      2
    • Step 3: On the “Classify” tab:
      --> Click “Choose” to select the algorithm that you want to run. For example: if you want to run a classifier using “Random Forest” algorithm, select RandomForest as shown in the screenshot below:
      3
      --> Click “Percentage split” in the “Test options” and put 90% (this means we want to get 90% of our data set for training, 10% for testing).
      --> Click “Start” to run the classifier:
      4
    • Step 4: Get the classifier results. Three measurements were reported in our manuscript: Precision, Recall and F-Measure as shown in the screenshot below.
      5
  • Notes

    • For each Weka input file that we generated from the Python program section and each algorithm (Random Forest, Naïve Bayes, J48), we built one classifier. Therefore, in total, we implemented 9 classifiers as reported in detail in the manuscript.
    • We also used Weka to run Information Gain feature selection. To do so:
      --> On the “Select attributes” tab:
      Click “Choose” and select “InfoGainAttributeEval” as shown in the screenshot below.
      6
      Click “Start” to run the Information Gain feature selection.
      --> Weka generated a list of informative words selected by Information Gain feature selection strategy. We then used the python script (above) to generate the data file “WordsSelectedByInformationGain.csv” and the Weka input file “WordsSelectedByInformationGain_weka_input.arff” accordingly.

For any questions about the instruction, please contact:
Linh Hoang - lhoang2@illinois.edu.