#Data Science in 30 Minutes: Spark Streaming and Basic Analysis
To view the talk that goes along with this repo, click here.
You can easily install all of the Python requirements with Continuum Analytics' conda - if you haven't heard of it yet, we'd highly recommend taking a look!
The easiest way to install all these packages is the following, once you've gotten conda installed:
conda env create --name ds30 --file environment.yml
More importantly you'll need a working PySpark install (have pyspark in your path). You can download Spark here.
The presentation uses Jupyter notebooks, with a Scala/Spark kernel for ingesting data and a Python kernel for analysis.
The following will help you duplicate our (admittedly aged) kernel setup. We'll assume that you have already installed a Python environment, iPython, and Jupyter through either Anaconda or some other method.
- You will need to have a working Java installation with $JAVA_HOME set. On Ubuntu, you can e.g.
sudo apt-get install default-jdk
. wget https://oss.sonatype.org/content/repositories/snapshots/sh/jove/jove-spark-cli_1.3_2.10/0.1.1-1-SNAPSHOT/jove-spark-cli_1.3_2.10-0.1.1-1-SNAPSHOT.tar.gz
- Unpack with
tar xvf jove-spark-cli....tar.gz
mv jove-spark-cli...SNAPSHOT jove-spark
for convenience- Run
./jove-spark/bin/jove-spark-1.3 --kernel-spec
- Check your installed kernels with
jupyter kernelspec list
. You should see the Spark kernel installed.
If you choose to use a different (newer) kernel, the setup may vary. The three dependencies you'll need in the Scala/Spark kernel are:
- org.apache.spark %% spark-streaming % 1.3.1
- org.apache.spark %% spark-streaming-twitter % 1.3.1
- com.google.code.gson % gson % 2.4
Lastly, for Twitter data, you'll need to register an application and enter your credentials in the twitter4j.properties
file.
This talk was created by Ariel M'ndange-Pfupfu, a Data Scientist at The Data Incubator. He has worked on a variety of data science, software engineering, and curriculum development projects and is also a current Bleeker Fellow. He earned his Master’s degree at Stanford and his Ph.D. in Materials Science & Engineering from Northwestern.
DataBricks, one of the largest contributors to the Apache Spark project, has been instrumental in developing and supporting Spark education. The reference applications book was very useful for building this talk.
[The Data Incubator] (https://www.thedataincubator.com/) is a data science education company based in NYC, DC, and SF with both corporate training and recruiting services. For [data science corporate training] (https://www.thedataincubator.com/training.html), we offer customized, in-house training solutions in data and analytics. For [data science hiring] (https://www.thedataincubator.com/hiring.html), we run a [free 8 week fellowship] (https://www.thedataincubator.com/fellowship.html) training PhDs to become data scientists. The fellowship selects 2% of its 2000+ quarterly applicants and is free for Fellows. Hiring companies (including EBay, Capital One, Pfizer) pay a recruiting fee only if they successfully hire. You can read about us on [Harvard Business Review] (https://hbr.org/2014/08/the-question-to-ask-before-hiring-a-data-scientist/), [VentureBeat] (http://venturebeat.com/2014/04/15/ny-gets-new-bootcamp-for-data-scientists-its-free-but-harder-to-get-into-than-harvard/), or [The Next Web] (http://thenextweb.com/insider/2015/07/02/data-incubator-opens-a-west-coast-campus-to-groom-the-next-generation-of-data-scientists/), or read about our alumni at [LinkedIn] (http://blog.thedataincubator.com/2016/05/alumni-spotlight-xia-hong/), [Palantir] (http://blog.thedataincubator.com/2015/02/moving-to-palantir-from-mathematics-alumni-spotlight-on-justin-bush/), or the [NYTimes] (http://blog.thedataincubator.com/2015/02/alumni-spotlight-dorian-goldman-using-a-pure-math-background-to-solve-problems-for-the-new-york-times/).
For information on upcoming events, visit our [Eventbrite] (http://www.eventbrite.com/o/the-data-incubator-8342209540).