[Python] Proposed end-to-end (E2E) network analysis process needed to disrupt the spread of fake news by limiting the activity of infected nodes.
Community Detection (Leading Eigenvector, Asynchoronous Label Propagation, Fluid Communities, Kernighan Lin Maximization), Random and Targeted Network Attacks
Exploratory Data Analysis, Feature Engineering, Data Visualization, Network Analysis, Distributed Network Disruption, Community Detection, Big Data Processing
iGraph, sklearn, seaborn, pyplot, graphframes, networkx, pyspark, Tigergraph, MapReduce
For many people, Twitter has become an alternative source for breaking news. With 59% of its 436M users leveraging it for this purpose, ensuring content veracity becomes paramount. This is especially so in the high stakes area of politics, where misinformation has caused far-reaching polarization as seen in both the 2020 US presidential elections and Brexit.
Hence, a control and mitigating mechanism is required to: (a) identify influencers in powerful communities, (b) determine the nature of said entities and (c) take required measures for those with malicious intent.
The FakeNewsNet dataset was utilized - more specifically, the Politifact subset. The original dataset were consisted of data from user and news. We processed the data and came up with Tweets and Followers data which contained 573,637 unique users with more than 1 billion potential follower-followee matrix interactions.
We found that in the dataset, there were 779 news (47% was fake news), 544,027 tweets (28% was fake news) and 68,099 retweets (27% retweeted fake news).
From the initial dataset, 2 network structures were created with the following objectives:
(1) Tweet-Retweet Network which was explored to target fake-news spreading community based on their retweeting connections.
(2) Follower-Followee Network which was explored to target fake-news spreading community based on their follower connections.
As the data was too large to efficiently create the network structure, we reduced the network size by extracting the core data by setting certain thresholds.
Four network clustering algorithms - Asynchronous Label Propagation, Kernighan-Lin Maximization, Fluid Communities and Leading Eigenvector - were explored. The chosen algorithm was Leading Eigenvector as it had the best clustering performance.
For network disruption simulation, two configurations were tested, random and targeted attacks. For targeted attack, nodes with the largest bubble size were prioritized because these are nodes which had the highest fake retweet ratio or had high network degree.
Four main conclusions can be derived from the network analysis:
(1) Fake news spread is better represented through Tweet – Retweet relationships.
(2) Twitter communities can be identified using Leading Eigenvector algorithm, based on modularity
(3) Identify communities based on users with high % of fake news tweets
(4) Community disruption is possible by targeting network attack based on vertex degree. The simulated attack was shown to reduce spread of fake news by 40% (Avg. degree 2→1.2).
Gino Tiu
Widya Salim
Felipe Chapa
Susan Koruthu