In this developer journey we will use Jupyter notebooks to analyze Twitter data and extract interesting insights from tweets. It will easily perform complex computations on a large amount of data in a notebook by using SparkContext, which enables you to run tasks on the Spark cluster. In addition, it will integrate data from dashDB using the Spark connector and use Spark and pandas DataFrames.
When the reader has completed this journey, they will understand how to:
- Create a dashDB warehouse containing Twitter data that includes advanced enrichments like sentiment, gender and location.
- Transform the data using Python code with Apache Spark and pandas data frames.
- Create visualizations using Matplotlib and Google GeoChart.
- Share notebooks and results.
The intended audience for this journey is application developers who need to efficiently build powerful data visualizations, but who may not have an abundance of time, data science experience or a dedicated data science team who could take ownership of such an effort.
- IBM Data Science Experience
- Bluemix Insights for Twitter
- Bluemix dashDB for Analytics
- Bluemix Analytics for Apache Spark
- Jupyter Notebooks
- Python
- pandas (Python Data Analysis Library)
- NumPy
- Wordcloud
- Matplotlib
- Google GeoChart
Follow these steps to setup and run this developer journey. The steps are described in detail below.
- Sign up for the Data Science Experience
- Create Bluemix services
- Configure dashDB
- Create the notebook
- Run the notebook
- Analyze the results
- Save and Share
Sign up for IBM's Data Science Experience. By signing up for the Data Science Experience, two services: DSX-Spark
and DSX-ObjectStore
will be created in your Bluemix account.
Create the following Bluemix services by clicking the Deploy to Bluemix button or by following the links to use the Bluemix UI and create them.
The first step in the analysis process is to harvest the Twitter data. In this sample, we use the dashDB connector to the IBM Insights for Twitter service to seamlessly specify a filter query (tweets about the six leading car manufacturers in 2015) and load the results into dashDB. This specific time frame and filter are being used to demonstrate analysis of a known Twitter spike.
-
Launch the dashDB, from the menu on the left, choose
Load
and selectLoad Twitter Data
. This is a five step process outlined below. -
Connect to Twitter: Select the existing dashDB service you created earlier (if you used the Deploy to Bluemix button it is named "dtaa-dashdb-service").
-
Search: In the Search for Twitter data box, enter the query below and retrieve a sampling of tweets.
posted:2015-01-01,2015-12-31 followers_count:2000 listed_count:1000 (volkswagen OR vw OR toyota OR daimler OR mercedes OR bmw OR gm OR "general motors" OR tesla)
- Select table: In the Load the data into new tables with this prefix field, enter any text you wish for a namespace.
- Load data: This step can take up to 20 minutes, dashDB shows you progress as the tweets load.
- Load complete: Peruse the reports that dashDB shows.
Use the menu on the left to select My Projects
and then Default Project
.
Click on Add notebooks
(upper right) to create a notebook.
- Select the
From URL
tab. - Enter a name for the notebook.
- Optionally, enter a description for the notebook.
- Enter this Notebook URL: https://github.com/IBM/dsx-twitter-auto-analysis/blob/master/notebooks/dsx_twitter_auto_analysis.ipynb
- Use the
Spark Service
pulldown to select yourDSX-Spark
service. - Click the
Create Notebook
button.
At the top of the notebook, provide your dashDB info.
- Make sure you update
SCHEMA
andPREFIX
with values you set when importing Twitter data into dashDB. - Use the
username
,password
, andjdbcurl
from the dashDBService Credentials
tab in BlueMix.
NOTE: You may have to create your own credentials using the New Credential
option.
When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
- A blank, this indicates that the cell has never been executed.
- A number, this number represents the relative order this code step was executed.
- A
*
, this indicates that the cell is currently executing.
There are several ways to execute the code cells in your notebook:
- One cell at a time.
- Select the cell, and then press the
Play
button in the toolbar.
- Select the cell, and then press the
- Batch mode, in sequential order.
- From the
Cell
menu bar, there are several options available. For example, you canRun All
cells in your notebook, or you canRun All Below
, that will start executing from the first cell under the currently selected cell, and then continue executing all cells that follow.
- From the
- At a scheduled time.
- Press the
Schedule
button located in the top right section of your notebook panel. Here you can schedule your notebook to be executed once at some future time, or repeatedly at your specified interval.
- Press the
The result of running the notebook is a report which may be shared with or without sharing the code. You can share the code for an audience that wants to see how you came your conclusions. The text, code and output/charts are combined in a single web page. For an audience that does not want to see the code, you can share a web page that only shows text and output/charts.
The charts produced in the journey clearly demonstrate the following:
- There was a significant negative peak in tweets about car manufacturers during the period 2015-09-15 to 2015-10-15.
- This spike was clearly dominated by tweets about the Volkswagen emissions scandal.
- The amount of Twitter influence was slightly higher for Volkswagen than other car manufacturers.
- The origin of the Volkswagen tweets was significantly from the United States and Europe.
- The gender of the tweeters for Volkswagen (and all car makers) was significantly unknown. If we were to assume that the known-gender tweets are a good representative, then we could say predominantly male.
Under the File
menu, there are several ways to save your notebook:
Save
will simply save the current state of your notebook, without any version information.Save Version
will save your current state of your notebook with a version tag that contains a date and time stamp. Up to 10 versions of your notebook can be saved, each one retrievable by selecting theRevert To Version
menu item.
You can share your notebook by selecting the “Share” button located in the top right section of your notebook panel. The end result of this action will be a URL link that will display a “read-only” version of your notebook. You have several options to specify exactly what you want shared from your notebook:
Only text and output
: will remove all code cells from the notebook view.All content excluding sensitive code cells
: will remove any code cells that contain a sensitive tag. For example,# @hidden_cell
is used to protect your dashDB credentials from being shared.All content, including code
: displays the notebook as is.- A variety of
download as
options are also available in the menu.
The following is a sample data visualization with code:
For a full example without code see data/examples/sample_output.pdf
.