Principles of Big Data Management : Disease Analysis

1. About the Project

We choose ‘Diseases’ as our topic to do big data analysis. Based on twitter tweets, we predicted some interesting analysis on Diseases using thousands of tweets tweeted by different people. First we collected the tweets from twitter API based on some key words related to Disease. After that, we analyzed the data that we have collected. By using the analysis, we written some interesting SQL queries useful to give a proper result for the analysis.

2. System Architecture

First we generated credential for accessing twitter. By using these credentials, we wrote a python program to collect twitter tweets based on keywords related to food. Tweets were stored in a text file in a JSON format. We will give these JSON file to SQL queries for analysis with Spark, Intellij with Scala program with queries.

3. Analyzing Twitter Data

Query 1: Popular Tweets on Different Diseases

In this query, we are fetching the diseases and its tweets count in the file. This query is written using RDD, where we are fetching the count of diseases using hashtags using filter and the count is printed further.

Query 2: Countries that tweeted more on Diseases (Google Maps)

In this query, the top countries that tweeted more on diseases is fetched. First the location in tweets are fetched from tweets file and count is displayed as shown below. The data is stored in .csv format and the file is read and Visualization is done on Google Maps.

Query 3: Popular Hashtags

In this query, we took popular hash tags text file from blackboard and performed JOIN operation with hash tags from diseases tweets file. The fetched data is stored in .csv format to do visualization.

Query 4: Most Popular Tweeted Words

In this query, most occurring words in tweets on diseases is fetched. On the fetched data visualization is done dynamically.

Query 5: On which day of week, more tweets are done on diseases

In this query, data is fetched based on which day of week more tweets are done on Diseases. Initially created_at is fetched from tweets file and count of tweets is done on each day of week.

Query 6: Top 10 Users Tweeted on Diseases

In this query the we are fetching top 10 users who tweeted more on diseases. This query is written using RDD. Initially for each disease, the top tweeted user is fetched and UNION RDD is used to club all the diseases. The results are stored in .csv file to do visualization

Query 7: Follower Id’s count using Twitter API

Twitter Get Followers ids API is used. A query to display five screen names from the tweets file is written. When the query is executed a table with ten screen names is displayed in the table.

Val request = new HttpGet("https://api.twitter.com/1.1/followers/ids.json?cursor=-1&screen_name=" + name)

First the user is given a Choice to enter a screen name of his choice. Once the screen name has been inputted the follower’s id

Once screen name RevistaCOFEPRIS is entered the follower id’s count are displayed as shown below

4. Related Links

Phase-1 Document: https://github.com/cmoulika009/Principles-of-Big-Data-Management/blob/master/PB%20Phase-1-%20Team%2011/PRINCIPLES%20OF%20BIG%20DATA%20MANAGEMENT%20PHASE%201.pdf

Phase-2 Document: https://github.com/cmoulika009/Principles-of-Big-Data-Management/blob/master/PB%20Phase-2-%20Team%2011/PB%20Phase-2%20Team-11.pdf

Final Project Document: https://github.com/cmoulika009/Principles-of-Big-Data-Management/blob/master/PB%20Phase-3-%20Team-11/PB%20Phase-3%20Team-11.pdf

Tweet Location: https://www.dropbox.com/s/04zebrisw6jm6n0/Disease_Tweets.json?dl=0

Youtube Video: https://youtu.be/dRO-2chnycM