/personalized_search_challenge

Attempt on a Kaggle competition, Personalized Web Search Challenge, hosted by Yandex (http://www.kaggle.com/c/yandex-personalized-web-search-challenge)

Primary LanguagePython

Attempt on a Kaggle competition, Personalized Web Search Challenge (hosted by Yandex)

URL: http://www.kaggle.com/c/yandex-personalized-web-search-challenge

Deadline: Friday, January 10, 2014

Team Members

  • Yosuke Sugishita
  • David Kim
  • Possibly Brendan and David Hsiao
  • Idea: Should we make this open to people in Data Science Club and local data meetups on Meetup.com? If we have too many people (I think 4-5 people in one team is a limit), we can make multiple teams and still work together.

Ideas on our team name

  • Asian Revolution
  • West Coasters
  • Canadian Kimchi Roll

File structure

  • script
    • file_manupulation
    • analysis
  • lib
    • Functions / classes to use in other scripts.
  • test
    • Scripts to test functions.
  • data
    • Contains all the data, like test and train. Not committed due to the large size of the files. Download them directly from Kaggle.

About branches / pull requests

All the code must be reviewed by at least by one other person before being pulled into the master. Make a branch, write code, test, and send a pull request. Use short, descriptive names for branches.

Never directly work on the master.

Tools

Notes on possible strategies (more on the wiki)

Two ways to look at this problem:

  1. Collaborative filtering (recommender) problem - Netflix Prize winners' solution: http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf
  2. We can also look at the past clicks a certain user has performed. - The user is probably more (or less) likely to click the pages they already clicked and liked. => Need to test this.

Our first strategy is based on 2. (Low-hanging fruits! Yay!)

Here is the paper I got inspiration from for this strategy: http://people.csail.mit.edu/teevan/work/publications/papers/wsdm11-pnav.pdf

Some notes on the data

The train file is big (16GB when uncompressed)

We need to think about how to handle this. Perhaps use a database, like sqlite or MySQL? I (Yosuke) suspect we can try our first strategies with a randomly-sampled subset of the data. How would we go about it?

Train and test

In the competition, the first 27 days are used as train data, and the last 3 days as test data. (http://www.kaggle.com/c/yandex-personalized-web-search-challenge/data)

Perhaps we can locally test our model using the first 24 days train and the next 3 days as test.