/community-evolution-analysis

A framework for the analysis of social interaction networks (e.g. induced by Twitter mentions) in time.

Primary LanguagePython

community-evolution-analysis

###A framework for the analysis of social interaction networks (e.g. induced by Twitter mentions) in time. We make available a Twitter interaction network collector and a set of Matlab and Python scripts that enable the analysis of social interaction networks with the goal of uncovering evolving communities. More specifically, the interaction network collector forms a network between Twitter users based on the mentions in the set of monitored tweets (using the Streaming API). The network of interactions is adaptively partitioned into snapshot graphs based on the frequency of interactions. Then, each graph snapshot is partitioned into communities using the Louvain method [2]. Dynamic communities are extracted by matching the communities of the current graph snapshot to communities of previous snapshot. Finally, these dynamic communities are ranked and presented to the user in accordance to three factors; stability, persistence and community centrality. The user can browse through these communities in which the users are also ranked in accordance to their own specific snapshot centrality, browse through wordclouds containing a summary of the most frequently used terms in each dynamic community and the most frequent urls. The PageRank algorithm [3] is used to measure the feature of centrality.

  • The master branch of this repository contains matlab and python files which form the current stable version of the framework, although the matlab version has been dropped since the end of 2013 and lacks certain features that the python version includes.
  • The "pci13" branch contains all the code and data needed to replicate the experiments performed in [1].
  • The "dev" branch contains more advanced but unstable versions of the framework.

[1] K. Konstandinidis, S. Papadopoulos, Y. Kompatsiaris. "Community Structure, Interaction and Evolution Analysis of Online Social Networks around Real-World Social Phenomena". In Proceedings of PCI 2013, Thessaloniki, Greece (to be presented in September).
[2] V. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre. "Fast unfolding of communities in large networks". In Journal of Statistical Mechanics: Theory and Experiment (10), P10008, 2008
[3] S. Brin and L. Page. "The anatomy of a large-scale hypertextual web search engine". Comput. Netw. ISDN Syst., 30(1-7):107{117}, Apr. 1998.

##Distribution Information## This distribution contains the following:

  • a readme.txt file with instructions on how to use the different parts of the framework;
  • a set of Python scripts (in the /python folder) that are used to conduct community evolution analysis.
  • a set of Matlab scripts (in the /matlab folder) that are used to conduct community evolution analysis and a set of Python scripts (in the /matlab/python_data_parsing folder) that are used to parse the json files retrieved by the data collector in a "Matlab friendly" form.

##Evolution analysis using Python##

Any new data (json files) to be analysed should be placed in the ../data/json/ folder. In order for the python files to work, the data should be in a json twitter-like form (the "entities", "user", "created_at", "id_str" and "text" keys and paths should be identical with twitter's).

###Code### The python code consists of 8 files containing friendly user scripts for performing Community Evolution Analysis from json and txt files acquired from the Twitter social network. The framework was implemented using Python 3.3 and the 3rd party libraries required for the framework to work are dateutil (requires pyparsing), numpy, matplotlib, PIL and networkx (http://www.lfd.uci.edu/~gohlke/pythonlibs/).

The python folder contains 8 files:

  • main.py
    This .py is the main framework file needed to run the CommunityRanking.py class file. It includes all the tweakable parameters. Timeslots are adaptively created.
  • CommunityRanking.py
    This .py is the class file that contains the Evolution Analysis Framework.
  • main_NONadaptive.py
    This .py is the main framework file needed to run the CommunityRanking_NONadaptive.py class file. It includes all the tweakable parameters. Timeslots are straightforwardly created.
  • CommunityRanking.py
    This .py is the class file that contains the Evolution Analysis Framework.
  • community.py
    This is a copy of Aynaud's implementation of the Louvain community detection algorithm.
  • tfidf.py
    This .py file computes the tfidf score of the words contained in each dynamic community.
  • wordcloud.py
    This is an edited copy of A.C. Mueller's wordcloud creation script.
  • query_integral_image.pyd
    Dependency of the wordcloud.py script.

###Python Results### The framework provides the user with 5 pieces of resulting data in the ../data/results/ folder: a) the user_activity.eps file which presents the user mentioning activity according to the selected sampling interval, b) usersPairs_(num).txt files which can be used with the Gephi visualization software in order to view the praphs, c) the rankedcommunities variable (from main.py) which contains all the communities (and their users) which evolved ranked in accordance to the persistence, stability and community centrality triplet, d) the community size heatmap (communitySizeHeatmap.eps) which provides a visualization of the sizes of the first 100 most important communities (ranked from top to bottom), a set of wordclouds (in the ../results/wordclouds folder) collaged in a manner similar to the heatmap (separate wordclouds for every ranked dynamic community are also created in the same folder) and e) the rankedCommunities.json file which contains all the ranked communities along with all the information regarding the specific timeslot of evolution for each community, the persistence, stability and community centrality values and all the users in each community accompanied by their own centrality measure. As such, the framework can be used to discover the most important communities along with the most important users inside those communities, follow the most referenced URLs but also acquire an idea about the topics discussed via wordclouds.

##Evolution analysis using Matlab##

Any new data to be analysed should be placed in the ../data/ folder In the case where the user has data from a different source, in order for the python files to work, the data should either be in a json twitter-like form (the "entities", "user" and "created_at" keys and paths should be identical with twitter's) or in a txt file of the form:

user1 \TAB user2,user3... \TAB "created_at_timestamp" \TAB text \newline  

###Step1: json Parsing (Python)### The python parsing code consists of 8 files containing user friendly scripts for parsing the required data from json files. There are 4 files to be used with jsons from any other Twitter API dependant source.
More specifically, they are used to create txt files which contain the mentions entries between twitter users as well as the time at which these mentions were made and the context in which they were included.

The json_mention_multifile* files provide as many txt files as there are json files. They contain all the information required from the tweet in a readable form:

user1 \TAB user2,user3... \TAB "created_at_timestamp" \TAB text \newline

The json_mention_matlab_singleFile* files provide a single file which contains only the data required to perform the community analysis efficiently. They contain information in a "Matlab friendly" form:

user1 \TAB user2 \TAB unix_timestamp \TAB \newline
user1 \TAB user3 \TAB unix_timestamp \TAB \newline

This folder contains 6 files:

  • json_mention_multifile_parser.py & json_mention_matlab_singleFile_parser.py
    These files are used when the user has *.json files.
  • json_mention_multifile_noDialog_parser.py & json_mention_matlab_singleFile_noDialog_parser.py
    These are similar to the previous files but the dataset folder path has to be inserted manually (Does not require the wxPython GUI toolkit).
  • txt_mention_matlab_singleFile_parser.py & txt_mention_matlab_singleFile_noDialog_parser.py
    These files are used when the user has *.txt files from another source.

The resulting authors_mentions_time.txt file should be added to the ../data/ folder.

###Step2: Evolution Analysis (Matlab)### The matlab code consists of 13 files which can either work as standalone scripts, or as functions of the main.m script. The only thing that needs changing is a comment of the function line in each of the m-files (instructions are included in the m-files).
These 13 files are:

  • step1_mentioning_frequency.m
    This .m file extracts the user activity in respect to twitter mentions
  • step2_dyn_adj_mat_wr.m
    This .m file extracts the dynamic adjacency matrices for each respective timeslot and save it into a mat format for use with the rest of the code but also in a csv gephi-ready format.
  • step3_comm_detect_louvain.m
    This .m file detects the communities in each adjacency matrix as well as the sizes of the communities and the modularity for each timeslot using Louvain mentod (V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008 (12pp), 2008.).
  • step4_comm_evol_detect.m
    This .m file detects the evolution of the communities between timeslots.
  • step5_commRoutes.m
    This .m file detects the routes created by the evolution of the communities between timeslots.
  • step6_usrCentrality.m
    This .m file extracts the user centrality of all adjacency matrices in between timeslots using the pagerank algorithm.
  • step7_comm_dyn_adj_mat_wr.m
    This .m file extracts the community adjacency matrix in between timeslots (in this case, the communities are treated as users). The commAdjMats are written in mat format but also as csvs for gephi.
  • step8_commCentralityExtraction.m This .m file extracts the centrality of the community adjacency matrices using the PageRank algorithm.
  • step9_commRank_commCentr.m This .m file provides an analysis of the communities in respect to their evolution in terms of community centrality.
  • step9_commRank_stability.m This .m file provides an analysis of the communities in respect to their evolution in terms of stability.
  • step9_commRank_persist.m This .m file provides an analysis of the communities in respect to their evolution in terms of persistence.
  • step9_commRank_synergy.m This .m file provides an analysis of the communities in respect to their evolution in terms of persistence, stability and community centrality. A heatmap presenting the evolution and size of all evolving communities is produced giving an idea of the bigger picture.
  • step9_commRank_comparison.m This .m file provides a comparison of the communities perceived as most significant by 3 different community evolution factors: stability, persistance and community centrality. The synergy of the 3 is also available for comparison.

There are also 4 assistive functions which are used to extract the position of each user in the adjacency matrix (pos_aloc_of_usrs.m), to create the adjacency matrix (adj_mat_creator.m), to perform the community detection (comm_detect_louvain.m) and to extract the centrality of each user using the pagerank algorithm (mypagerank.m).

###Matlab Results### The final outcome is a cell array in ../data/mats/signifComms.mat containing the most significant dynamic communities, their users and the centrality of the users. .../data/mats/commEvol.mat and .../data/mats/commEvolSize.mat are also useful as they present the evolution of the communities and the community sizes respectfully. A heatmap presenting the evolution and size of all evolving communities is also produced giving the user an idea of the bigger picture (../data/figures/).