/Italian_Twitch_Community_Graph_Database

Implementation of a graph database of the most popular Italian Twitch streamers based on the number of viewers shared. The information are scraped from multiple sources.

Primary LanguageCypherMIT LicenseMIT

Twitch community graph

Abstract

Twitch.tv is a live streaming platform that allows streamers to broadcast and users to enjoy content in real time. The broadcasts cover various categories related mainly to the world of videogames, entertainment, and the arts. Thanks to its great success, especially in the last few years, both the revenue opportunities for streamers and companies operating in these sectors have increased. Understanding the market and the platform, however, is crucial to discover the interests of users. This project therefore aims to collect and analyze data about the different streams in order to create an explorable and queryable graph model of the communities present thus enabling accurate market analysis.

The project consists in a series of scripts to collect, integrate, analyze and save data from different sources. It is thus a tool that can be run in any time frame to obtain the up-to-date graph of the situation. The data collection phase is done from two distinct data sources: Twitch for live information through the use of the official Web APIs and from SteamDB for videogames informations through dynamic scraping techniques. In the processing phase, the datasets containing the streamers, the different video games streamed, and the related bridge-tables that allow them to be linked are then obtained. The streamer-game relations were calculated by analyzing the broadcast categories, while the streamer-streamer relations were calculated by evaluating the percentage of common viewers between each pair of streamers.

This repository contains data collected over a two-week period in May 2022 regarding all Italian broadcasts on Twitch and data from SteamDB regarding the most played videogames. Approximately 2.5GB of data were collected during this period, which after a detailed analysis allowed the creation of a graph model on the Neo4j DBMS consisting of 4121 nodes and 54931 edges.

Graph visualization on Gephi May 2022

Execution scheme

Pipeline

1. Data Collection

  1. Follow this doc, obtain your Twitch API keys (ClientID and ClientSecret) and paste them in the Twitch_API_keys.txt file
  2. Create a repeated execution task for Twitch_stream_collection.py every xx minutes (Win: Task Scheduler, Linux: Crontab)
    • choose the details (es. language) of the desired streams
    • this script saves the collected stream files in individual json files but it's already supported the upload on MongoDB local server, uncomment the import function in the script (it requires MongoDB Community Server)
  3. Run steam_games_scraping.ipynb to scrape SteamDB website (if the website asks CAPTCHA try to clean browser cookies)
  4. Download the bot-users dataset from Twitch Insights using a browser extension (e.g. Table Capture for Chrome) and save it as Twitch_bot_list.csv
  5. Run Twitch_social_link.py to obtain the streamer's social link (this can be run only after the collecting and processing phases because it requires the complete streamer list)

2. Data Processing

  1. Run DataProcessing.ipynb selecting the parameters for the analysis in the first block:
    • data source (json files or MongoDB local server)
    • set the time interval acquisition (xx minutes)
    • set parameters and thresholds
  2. Run DataEnrichment.ipynb to add games info from SteamDB (verify manually the matches)
  3. Run DataExploration.ipynb and DataQuality.ipynb to obtain data insights

3. Data Modelling

  1. Install Neo4j Community Server
  2. Copy the CSVs obtained from the output_datasets folder to the neo4j import folder (neo4j/import/)
  3. Run graph_neo4j.ipynb to load data in Neo4j
  4. Execute desired queries

4. Data Visualization

  1. Install Gephi
  2. Import Streamer_dataset_short.csv and Streamer-Streamer_dataset_short.csv
  3. Execute some layout algorithms (e.g. Atlas Force), execute statistics analysis to detect communities (e.g. Modularity), edit nodes and edges colors (more details here)

For additional info on the project read ProjectReport_ita.pdf (in Italian)