This repository contains the scripts used in the article "Mapping urban linguistic diversity with social media and population register data" published in Computers, Environment and Urban Systems.
- Twitter and Instagram data from the Helsinki Metropolitan Area from the year 2015.
- Instagram data is legacy data we had collected in 2016 before the API was closed down. The Twitter data can be collected with tweetsearcher.
- Statistics Finland 250 m grid database from year 2015
- Apply from here
- Individual-level first language information aggregated into 250 m grid from 2015
- Apply from here
- Dynamic population data from Helsinki Metropolitan Area by Bergroth et al. (2022) from here
- The language detection of the social media was done with fastText using scripts from Hiippala et al. (2020)
- Linguistic diversity of register data was calculated in the Statistics Finland secure environment FIONA with a similar script to neighborhood_diversities.py.
Step | Script | Description | Input | Output |
---|---|---|---|---|
1 | combine_twitter_insta.py | Combines Twitter and Instagram data | Instagram and Twitter point features geopackage | Twitter-Instagram combined point features geopackage |
2 | neighborhood_diversities.py | Calcualtes linguistic diversities across times of day | Output from step 1 and grid database | Grid database with diversity metrics |
3 | neighborhood_diversities_no_timeofday.py | Calcualtes linguistic diversities for social media as a whole | Output from step 1 and grid database | Grid database with diversity metrics |
4 | clean_socioeco.py | Cleans socio-economic grid database | Raw RTK database file | Cleaned grid database |
5 | join_dynpop.py | Joins dynamic population to output from step 4 | Dynamic population data and output 4 | Grid database with dynamic population |
6 | join_regdiv_to_socioeco.py | Joins diversity metrics in registry with grid database from output 5 | Register data with linguistic diversity metrics and output from step 5 | Grid database |
7 | user_langprofiles.py | Calculates social media user linguistic profiles | Output from step 1 | Latex-formatted table |
8 | moran_cluster.py | Calculates clusters in register and social media grid data | Outputs from steps 6 and 3 | Geopackage with clusters |
9 | stability_socioeco.py | Classifies social media clusters based on temporal stability | Output from step 8 | Geopackage with stability classficiations |
10 | extract_high_clusters.py | Extracts significant high linguistic diversity clusters | Output from step 9 | Geopackage with high diversity clusters |
11 | kde_plot.py | Plots linguistic diversity across times of day and the register data | Outputs from steps 6 and 3 | PNG file |
12 | regression_ols_timeofday.py | Performs the OLS regression analysis | Outputs from steps 6 and 3 | Model files, VIF dataframes, error plots |
13 | SLM regression in GeoDA | Run SLM regression | Outputs from steps 6 and 3 | SLM model summaries |
14 | OLS_summaries.py | Prints OLS summaries | OLS model files from step 12 | Latex-formatted OLS summaries |
15 | get_coef_table.py | Prints coefficient table from regression analyses | Output from step 14 | Latex-formatted table |
15 | read_vif_files.py | Prints VIF test scroes | VIF dataframes from step 12 | Latex-formatted table |
The SLM analysis was conducted in GeoDA with 1st order Queen contiguity neighborhoods.