Advanced Big Data Analytics

##Homework 1

###Questions

Download and Install Spark. Learn how to use it.
Download Wikipedia dataset. Extract about 100 pages (items) based on your own interest. You may use snowball method to crawl a few related/linked pages. Create TF-IDF of each page.
Use Twitter Streaming API to receive real-time twitter data. Collect 30 mins of Twitter data on 5 companies using keyword=xxx (e.g., ibm). Consider all Twitter data from a company is one document. Create TF-IDF of each company’s tweets in that 30 minutes.
Use Yahoo Finance to receive the Stock price data. Collect 30 mins of Finance data on 5 companies, one value per minute. Use the outlier function to display outliers that are large than two standard deviation.

###Setup

$ virtualenv hw1
$ source hw1/bin/activate

Install all the libraries in the virtual environment from the requirements.txt file using the command:

$ pip install -r requirements.txt

###Solution

bahuljain/Adv-Big-Data-Analytics