This application is developed as an assignment for the lesson Advanced Data Science and Mining, a post graduate lesson in the National Technical Universtiy of Athens, Electrical and Computer Engineering school. The purpose of the assignment is to use services from the IBM Bluemix framework in order to manage big data.
The application consists of three parts.
The first part collects Twitter's Trends for a location.
You can change the location or use your credentials for the Trends.
The collection interval is adjustable.
You can change it in the gr.ntua.collector.TrendsCollector
class in the @Scheduled(fixedRate = 7200000)
parameter changing the fixedRate
value.
We collect trends every two hours.
The second part is to find bursting Trends.
For each collected Trend we seek in the database to find previous occurrences.
If there is none, the trend is marked as firstSeen
.
In the other case, we compute the percent change from the last occurrence.
The third part is to classify a number of tweets as fake or real based on two trained classifiers.
To use the application you have to provide a properties file.
The application expects a file name credentials.properties
in the class path (you can safely create it in the src/main/resources
directory).
This file has to have the following properties:
twitter.consumer_key=
twitter.consumer_secret=
twitter.access_token=
twitter.access_token.secret=
twitter.trends.place.id=
bluemix.tone_analyzer.username=
bluemix.tone_analyzer.password=
bluemix.natural_language_classifier.classifier_id=
bluemix.natural_language_classifier.username=
bluemix.natural_language_classifier.password=
Apparently you will provide your own credentials for that services.
The data sets we used to train the Weka classifier and the Bluemix Natural Language Classifier is in the src/main/resouces/dataset
directory.
There are three files.
The fake_real_news.csv
is the original data set which has id, title, text
, and label
.
The bluemix.csv
has only the title
and label
columns from the original data set and it is used to create the tone vectors using Bluemix Tone Analyzer and to train the Bluemix Natural Language Classifier.
The results from the Tone Analyzer are in the weka.csv
file.
We used this service to extract a feature vector for each train instance from the bluemix.csv
file.
This service returns float values between zero and one for the attributes: anger,disgust,fear,joy,sadness,analytical,confident,tentative,openness,conscientiousness,extraversion,agreeableness,neuroticism
.
The file with the feature values from the Tone Analyzer is used to create a Random Forest classifier from the Weka classifiers.
This classifier had the best performance compared with others from the Weka tool box.
The trained model is provided in the src/main/resources/weka/RF.model
file.
In order to classify a number of tweets, we use the Tone Analyzer to get the features vectors.
This vector is given to the RF.model
and we get a prediction for the class (fake / real) of the tweet.
To train this service we used the bluemix.csv
file.
This services returns just a classifier id
.
To use the service we just send the text of the tweet and it returns it's class.
To run the application you have to change database credentials in the src/main/resources/application.properties
file.
We used MySQL but you can use any database you like.
In a terminal type:
$ mvn spring-boot:run
and open a browser to the localhost:8080
URL.
Also take in mind that the port 8080 should not be used by other processes.
It's prefered not to use Firefox because it does not support the HTML5 date fields.
To see results and be able to use the application it is recommended to let it run for at least 12 time intervals.
Each time interval is the time you have set in the TrendsCollector
class.
So, if it is two hours, the application should run for 24 hours.
When you have collected some data, just select the dates (or let the form fields unfilled in order to use the whole collection) and the burstiness of the Trends.
For example, if you choose 20
for the burstiness, you will see tweets that are firstSeen
or their burstiness is above the selected value (20).
Also the results are between the dates you have chosen.
Choose the more info
button of a Trend and the application returns a graph with the volume of that trend in the selected period and a number of tweets.
The tweets are collected using the Trend as a keyword.
Each tweet is then classified with the two classifiers.
The average score of the two classifiers is given as the final score for the fake / real class of each tweet.
The result for the class (fake/real) for each tweet is just the "guess" of some classifiers and you should not take it as a fact. Bare in mind that the classifiers are trained with data from news and not from tweets. Tweets have quite different language, audience and philosophy. So, the classifiers have been observed to be biased to the fake class.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.