/Insight-Data-Challenge-2015

For insight code challenge

Primary LanguageOpenEdge ABL

Overview

The program is designed for Insight Data Engineering Code Challenge, which is implemented in python mainly

#Software requirement:

Require python 2.7, collections, requests.
The program is implemented in linux, need "split" and "wc" command. For windows user, please use cygwin

#To run the code:

use "run.sh" to run the code and use 'gen_input.sh' to generate fake tweet for testing if necessary. There will be three options
[1] calculate feature 1 and 2 use single thread (for small amount of data,~500,000 lines of tweet)
[2] calculate feature 1 and 2 using multi thread (recommended for large data)
[3] clean all the data (if it says "rm: cannot remove '/*’: No such file or directory", you are still good. It means the file has been removed previously)

#Design Idea:

  • For storage
    use dictionary and list in python to store counted word and median, cPickle is used to store final results
  • For word count
    use collections.Counter for counting, split tweet by space. After parsing, merge two dictionary together
  • Find median
    use bisect.insort to sort the median array [o(nlog(n)) complexity] so that it can find median in o(1)
  • Implementation detail
    calculation, result storage and progress recording are done in feature_manager.py
    It can resume unfinished work (every 500,000 lines by default)
  • Test input
    "gen_test_input.py" is used to generate random strings for testing purpose
  • Parallel computing
    to deal with large data, "multi_worker.py" do things in parallel.(6 thread by default)
    for large file, split it into multiple small chunks for processing (15 small chunks by default)
  • Customize setting
    please edit setting in "common_var.py"

#IMPORTANT NOTE:

  • I assume previous tweets are not modified and every new tweet come in the end of "tweets.txt"
  • The code will record the number of tweet that has been parsed, neglected those parsed
  • It can not deal with modification of previous tweet, but this function can be added easily
  • Please clear data before switching between single worker and multiple workers

#Some test results:

The result is tested on a Lenovo W540 (i7, 8GB). The test inputs are random strings

Number of lines Single worker Multiple worker (6 threads)
200,000 8.5538s 14.0397086s
500,000 36s 26.7449s
1,000,000 144.865358829s 91.3351509571s