##Homework 1
###Questions
-
Download and Install Spark. Learn how to use it.
-
Download Wikipedia dataset. Extract about 100 pages (items) based on your own interest. You may use snowball method to crawl a few related/linked pages. Create TF-IDF of each page.
-
Use Twitter Streaming API to receive real-time twitter data. Collect 30 mins of Twitter data on 5 companies using keyword=xxx (e.g., ibm). Consider all Twitter data from a company is one document. Create TF-IDF of each company’s tweets in that 30 minutes.
-
Use Yahoo Finance to receive the Stock price data. Collect 30 mins of Finance data on 5 companies, one value per minute. Use the outlier function to display outliers that are large than two standard deviation.
###Setup
- First setup and run a virtual environment using the command:
$ virtualenv hw1
$ source hw1/bin/activate
- Install all the libraries in the virtual environment from the
requirements.txt
file using the command:
$ pip install -r requirements.txt
###Solution