This repository contains all the documents related to PB-Fall 2016. This course will introduce the essential characteristics of Big Data and why it demands rethinking how we store, process, and manage massive amounts of structured and unstructured data. It will cover the core technical challenges in Big Data management i.e., the storage, retrieval, and analysis of Big Data. It will emphasize on fundamental concepts, analytical skills, critical thinking, and software skills necessary for solving real-world Big Data problems. Tools such as Apache Hadoop, Pig, Hive, HBase, and IBM Jaql will be covered.
Java
Principles of Big Data Management : Disease Analysis
1. About the Project
We choose ‘Diseases’ as our topic to do big data analysis. Based on twitter tweets, we predicted some interesting analysis on Diseases using thousands of tweets tweeted by different people. First we collected the tweets from twitter API based on some key words related to Disease. After that, we analyzed the data that we have collected. By using the analysis, we written some interesting SQL queries useful to give a proper result for the analysis.
2. System Architecture
First we generated credential for accessing twitter. By using these credentials, we wrote a python program to collect twitter tweets based on keywords related to food. Tweets were stored in a text file in a JSON format. We will give these JSON file to SQL queries for analysis with Spark, Intellij with Scala program with queries.
3. Analyzing Twitter Data
Query 1: Popular Tweets on Different Diseases
In this query, we are fetching the diseases and its tweets count in the file. This query is written using RDD, where we are fetching the count of diseases using hashtags using filter and the count is printed further.
Query 2: Countries that tweeted more on Diseases (Google Maps)
In this query, the top countries that tweeted more on diseases is fetched. First the location in tweets are fetched from tweets file and count is displayed as shown below. The data is stored in .csv format and the file is read and Visualization is done on Google Maps.
Query 3: Popular Hashtags
In this query, we took popular hash tags text file from blackboard and performed JOIN operation with hash tags from diseases tweets file. The fetched data is stored in .csv format to do visualization.
Query 4: Most Popular Tweeted Words
In this query, most occurring words in tweets on diseases is fetched. On the fetched data visualization is done dynamically.
Query 5: On which day of week, more tweets are done on diseases
In this query, data is fetched based on which day of week more tweets are done on Diseases. Initially created_at is fetched from tweets file and count of tweets is done on each day of week.
Query 6: Top 10 Users Tweeted on Diseases
In this query the we are fetching top 10 users who tweeted more on diseases. This query is written using RDD. Initially for each disease, the top tweeted user is fetched and UNION RDD is used to club all the diseases. The results are stored in .csv file to do visualization
Query 7: Follower Id’s count using Twitter API
Twitter Get Followers ids API is used. A query to display five screen names from the tweets file is written. When the query is executed a table with ten screen names is displayed in the table.