maphel-urbanlingdiv: A Python repository from DigitalGeographyLab

This repository contains the scripts used in the article "Mapping urban linguistic diversity with social media and population register data" published in Computers, Environment and Urban Systems.

Data requirements

Twitter and Instagram data from the Helsinki Metropolitan Area from the year 2015.
- Instagram data is legacy data we had collected in 2016 before the API was closed down. The Twitter data can be collected with tweetsearcher.
Statistics Finland 250 m grid database from year 2015
- Apply from here
Individual-level first language information aggregated into 250 m grid from 2015
- Apply from here
Dynamic population data from Helsinki Metropolitan Area by Bergroth et al. (2022) from here

Pre-analysis steps

The language detection of the social media was done with fastText using scripts from Hiippala et al. (2020)
Linguistic diversity of register data was calculated in the Statistics Finland secure environment FIONA with a similar script to neighborhood_diversities.py.

Suggested running order of scripts

Step	Script	Description	Input	Output
1	combine_twitter_insta.py	Combines Twitter and Instagram data	Instagram and Twitter point features geopackage	Twitter-Instagram combined point features geopackage
2	neighborhood_diversities.py	Calcualtes linguistic diversities across times of day	Output from step 1 and grid database	Grid database with diversity metrics
3	neighborhood_diversities_no_timeofday.py	Calcualtes linguistic diversities for social media as a whole	Output from step 1 and grid database	Grid database with diversity metrics
4	clean_socioeco.py	Cleans socio-economic grid database	Raw RTK database file	Cleaned grid database
5	join_dynpop.py	Joins dynamic population to output from step 4	Dynamic population data and output 4	Grid database with dynamic population
6	join_regdiv_to_socioeco.py	Joins diversity metrics in registry with grid database from output 5	Register data with linguistic diversity metrics and output from step 5	Grid database
7	user_langprofiles.py	Calculates social media user linguistic profiles	Output from step 1	Latex-formatted table
8	moran_cluster.py	Calculates clusters in register and social media grid data	Outputs from steps 6 and 3	Geopackage with clusters
9	stability_socioeco.py	Classifies social media clusters based on temporal stability	Output from step 8	Geopackage with stability classficiations
10	extract_high_clusters.py	Extracts significant high linguistic diversity clusters	Output from step 9	Geopackage with high diversity clusters
11	kde_plot.py	Plots linguistic diversity across times of day and the register data	Outputs from steps 6 and 3	PNG file
12	regression_ols_timeofday.py	Performs the OLS regression analysis	Outputs from steps 6 and 3	Model files, VIF dataframes, error plots
13	SLM regression in GeoDA	Run SLM regression	Outputs from steps 6 and 3	SLM model summaries
14	OLS_summaries.py	Prints OLS summaries	OLS model files from step 12	Latex-formatted OLS summaries
15	get_coef_table.py	Prints coefficient table from regression analyses	Output from step 14	Latex-formatted table
15	read_vif_files.py	Prints VIF test scroes	VIF dataframes from step 12	Latex-formatted table

Note on analysis

The SLM analysis was conducted in GeoDA with 1st order Queen contiguity neighborhoods.

DigitalGeographyLab/maphel-urbanlingdiv

Data requirements

Pre-analysis steps

Suggested running order of scripts

Note on analysis