Spark-based machine learning for capturing word meanings

Data source: tweets

Algorithms: Word2Vec, K-means, PCA

Tools: IBM Watson Studio, Spark, Python

Same techniques can be applied to text documents, product reviews, etc...

This is a tutorial to build Spark-based machine learning models for capturing word meanings. You can learn how to build a word2vec model using Twitter data on IBM's Watson Studio using Apache Spark.

To read the blog, please click here.

Instructions:

Step 1. If you already have an account on IBM's Watson Studio, go to Step 2. If not, you can create an account for free here..

Step 2. Create a project on Watson Studio.

Step 3. Get the data into Watson Studio

Download (without uncompressing) some tweets from here to your lap top. The tweets.gz file contains a 10% sample (using Twitter decahose API) of a 15 minute batch of the public tweets from December 23rd. The size of this compressed file is 116MB (compression ratio is about 10 to 1)
Go to your recently created project on Watson Studio and click on the +New data asset icon
Load the file by clicking on browse or just dropping it
Once the file is loaded, click on Apply to add this file to your project.

You should see your tweets file under the data assets list of your project.

Step 4. Get a python environment and create a jupyter notebook

Go to your environments tab and create a Default Spark Python 3.5 XS environment
Click on From URL (3rd tab), choose a name for your notebook (ex: "Spark-based ML to capture word meaning"), copy and paste this url https://github.com/IBMDataScience/word2vec/blob/master/Spark-based%20machine%20learning%20for%20word%20meanings.ipynb into the Notebook URL rectangle and click on Create Notebook.

You are now in your new notebook and the rest of the instructions are in there.

IBMDataScience/word2vec