Overview

The program is designed for Insight Data Engineering Code Challenge, which is implemented in python mainly

#Software requirement:

Require python 2.7, collections, requests.
The program is implemented in linux, need "split" and "wc" command. For windows user, please use cygwin

#To run the code:

use "run.sh" to run the code and use 'gen_input.sh' to generate fake tweet for testing if necessary. There will be three options
[1] calculate feature 1 and 2 use single thread (for small amount of data,~500,000 lines of tweet)
[2] calculate feature 1 and 2 using multi thread (recommended for large data)
[3] clean all the data (if it says "rm: cannot remove '/*’: No such file or directory", you are still good. It means the file has been removed previously)

#Design Idea:

For storage
use dictionary and list in python to store counted word and median, cPickle is used to store final results
For word count
use collections.Counter for counting, split tweet by space. After parsing, merge two dictionary together
Find median
use bisect.insort to sort the median array [o(nlog(n)) complexity] so that it can find median in o(1)
Implementation detail
calculation, result storage and progress recording are done in feature_manager.py
It can resume unfinished work (every 500,000 lines by default)
Test input
"gen_test_input.py" is used to generate random strings for testing purpose
Parallel computing
to deal with large data, "multi_worker.py" do things in parallel.(6 thread by default)
for large file, split it into multiple small chunks for processing (15 small chunks by default)
Customize setting
please edit setting in "common_var.py"

#IMPORTANT NOTE:

I assume previous tweets are not modified and every new tweet come in the end of "tweets.txt"
The code will record the number of tweet that has been parsed, neglected those parsed
It can not deal with modification of previous tweet, but this function can be added easily
Please clear data before switching between single worker and multiple workers

#Some test results:

The result is tested on a Lenovo W540 (i7, 8GB). The test inputs are random strings

Number of lines	Single worker	Multiple worker (6 threads)
200,000	8.5538s	14.0397086s
500,000	36s	26.7449s
1,000,000	144.865358829s	91.3351509571s

randxie/Insight-Data-Challenge-2015

Overview