/Twitter-Data-Analysis-with-HPC

2023 S1 Cluster and Cloud Computing Assignment 1

Primary LanguagePythonMIT LicenseMIT

Cluster & Cloud Computing
Social Media Analysis

Project overview

The assignment is to implement a paralleized application leveraging the University of Melbourne HPC facility SPARTAN. The application will use a large Twitter dataset and a file containing suburbs, locations and Greater Capital cities of Australia.

The project objectives to:

  • count the number of different tweets made in the Greater Capital cities of Australia,
  • identify the Twitter accoutns (users) that have made the most tweets, and
  • identify the users that have tweeted from the most different Greater Capital cities.

More information, please visit project wiki.

Project Report: Overleaf

Team

Name Student ID Email
Sunchuangyu Huang 1118472 sunchuangyuh@student.unimelb.edu.au
Wei Zhao 1118649 weizhao1@student.unimelb.edu.au

Directories

COMP90024 Cluster & Cloud Computing - Assignment 1 TwitterAnalyser
|   |── data                     # store raw data
|       |── processed            # store processed data
|       |── result               # store output files
|── notebooks                    # processing and visualisation notebooks
|── scripts                      # main program scripts
|── slurm                        # slurm job scripts
|── doc
|   |── log                      # program log file
|   |── slurm
|       |── stderr               # slurm standard error
|       |── stdout               # slurm standard output
|── requirements.txt             # python dependencies
└── README.md

To start the program

For local testing, run the following commands:

# main.py must in execute mode
mpiexec -n [NUM_PROCESSORS] python main.py -t [TWITTER_FILE] -s [SAL_FILE] -e [EMAIL_TARGET|OPTIONAL]

# to submit a job in spartan hpc, run submission script
./submit.sh

Note, email target has only two valid options: 'rin' / 'eric'.

Assignment Dependencies

Main Python dependencies: python=3.7.4, mpi4py=3.0.4, polars, numpy, pandas.

If running on spartan, make sure use virtualenv with a python version 3.7.4 due to spartan load mpi4py version 3.0.4.

# hpc: load module on spartan
module --force purge
module load mpi4py/3.0.2-timed-pingpong

source ~/virtualenv/python3.7.4/bin/activate

# local: create a conda environment
conda env create --name comp90024 --file environment.yml

# install dependencies
pip install numpy pandas 'polars[all]'  #  or
pip install -r requirements.txt

Assignment report

The bigTwitter.json contains $9,092,274$ tweets written by $119,439$ authors. Start from date 2021-07-05 to 2022-12-31.

Processing time on BigTwitter.json.

Job Node Core Job Wall-Clock Time CPU Efficiency
46094405 1 1 00:11:01 98.34%
46094406 1 8 00:01:41 87.13%
46094407 2 4 00:01:41 87.75%

Task 1 Question: The solution should count the number of tweets made by the same individual based on the bigTwitter.json file and returned the top 10 tweeters in terms of the number of tweets made irrespective of where they tweeted. The result will be of the form (where the author Ids and tweet numbers are representative).

Rank Author Id Number of Tweets Made
#1 1498063511204760000 68,477
#2 1089023364973210000 28,128
#3 826332877457481000 27,718
#4 1250331934242120000 25,350
#5 1423662808311280000 21,034
#6 1183144981252280000 20,765
#7 1270672820792500000 20,503
#8 820431428835885000 20,063
#9 778785859030003000 19,403
#10 1104295492433760000 18,781

Task 2 Question: Using the bigTwitter.json and sal.json file you will then count the number of tweets made in the various captical cities by all users. The result will be a table of the form (where the numbers are representative).

For this task, ignore tweets made by users in rural location, e.g. lrnsw (Rural New South Wales), 1rvic (Rural Victoria) etc.

Greater Capital City Number of Tweets Made
1gsyd 2,218,689
2gmel 2,284,909
3gbri 878,614
4gade 465,081
5gper 590,045
6ghob 91,112
7gdar 46,772
8acte 214,347
9oter 203

Task 3 Question: The solution should identify those tweeters that have a tweeted in the most Greater Capital cities and the number of times they have tweeted from those locations. The top 10 tweeters making tweets from the most different locations should be returned and if there are equal number of locations, then these should be ranked by the number of tweets. Only those tweets made in Greater Capital cities should be counted.

Rank Author Id Number of Unique City Locations and #Tweets
#1 1429984556451389440 8 (#1920 tweets - #1879gmel, #13acte, #11gsyd, #7gper, #6gbri, #2gade, #1gdar, #1ghob)
#2 702290904460169216 8 (#1231 tweets - #336gsyd, #255gmel, #235gbri, #156gper, #127gade, #56acte, #45ghob, #21gdar)
#3 17285408 8 (#1209 tweets - #1061gsyd, #60gmel, #40gbri, #23acte, #11ghob, #7gper, #4gdar, #3gade)
#4 87188071 8 (#407 tweets - #116gsyd, #86gmel, #68gbri, #52gper, #37acte, #28gade, #15ghob, #5gdar)
#5 774694926135222272 8 (#272 tweets - #38gmel, #37gbri, #37gsyd, #36ghob, #34acte, #34gper, #28gdar, #28gade)
#7 502381727 8 (#250 tweets - #214gmel, #10acte, #8gbri, #8ghob, #4gade, #3gper, #2gsyd, #1gdar)
#6 1361519083 8 (#266 tweets - #193gdar, #36gmel, #18gsyd, #9gade, #6acte, #2ghob, #1gbri, #1gper)
#8 921197448885886977 8 (#207 tweets - #56gmel, #49gsyd, #37gbri, #28gper, #24gade, #8acte, #4ghob, #1gdar)
#9 601712763 8 (#146 tweets - #44gsyd, #39gmel, #19gade, #14gper, #11gbri, #10acte, #8ghob, #1gdar)
#10 2647302752 8 (#80 tweets - #32gbri, #16gmel, #13gsyd, #5ghob, #4gper, #4acte, #3gade, #3gdar)

Conclusion

In this project, we explore Amdahl’s Law by using MPI to process a large JSON file. While parallelism can significantly enhance performance, it is essential to consider potential trade-offs in terms of CPU efficiency. As Table 4 indicates, distributing work across multiple cores can reduce job wall-clock time. However, the benefit might diminish when scaling with multiple nodes, due to the increased time required for MPI communication between nodes. Moreover, parallelism may not be suitable for small datasets if a single core can efficiently solve the problem in a short time. Therefore, when designing a parallel program to maximize performance using MPI, programmers need to balance the trade-off between processing CPU efficiency and overall performance.

For complete assignment 1 report, please check overleaf.

LICENSE

The code will be public after Apr, 25th 2023. For @copyright information please refer to MIT License.


2023@Wei & Sunchuangyu