v2 -- Instead, I am going to do this a bit differently, and create individual prediction IO applications dealing with each question/answer pairing individually. This seems like a much more straightforward method of dealing with this. There are many benefits to this approach:
- A user can mark the fairness of each answer individually, which is the most correct way to make this sort of judgment
- A user can see individually the feedback for each answer to see why their dealing is not fair. It's a lot stronger to say that it's not a fair dealing because xx and yy.
v1 -- original version where I attempted to ascertain the fairness of the dealing in its entirety in PredictionIO.
- Install Docker
- Get the Community PredictionIO Docker image
- Install the Community Docker image (instructions at the above link)
- Clone this repo in the docker image. You will need to copy this six times (one for each question).
- Edit engine.json and update eventName
- Import the stop words and sample_dealing json files for each engine.
- Install the PredictionIO engines. You will need to do this six times, one for each question.
- Import some test data (for each of the engines):
x6:
- pio import --appid fairdeal-answer1 --input data/stopwords.json
- pio import --appid fairdeal-answer1 --input data/sample_dealing.json
- Test the engines
PredictionIO has excellent documentation that I recommend you read:
Text Classification Tutorial has a Quick Start guide and implementation details.
Event API has a good guide about how to send Event Data and Query data to PredictionIO through the API. This is how the Rails application communicates with PredictionIO for fairdeal.
Handling multiple events was my initial plan for sending a dealing through to PredictionIO and handling everything at once. This tutorial discusses how the weighting of different factors can be undertaken.
Re-structure and design preparator and algo. less memory usage and run time is faster.
Move BIDMach, VW & SPPMI algo changes to bidmach
branch temporarily.
Fix DataSource to read "content", "e-mail", and use label "spam" for tutorial data. Fix engine.json for default algorithm setting.
Modified PreparedData to use MLLib hashing and tf-idf implementations.
Fixed dot product implementation in the predict methods to work with batch predict method for evaluation.
Included three different data sets: e-mail spam, 20 newsgroups, and the rotten tomatoes semantic analysis set. Includes Multinomial Logistic Regression algorithm for text classification.
Fixed import script bug occuring with Python 2.
Changed data import Python script to pull straight from the 20 newsgroups page.