A repository for the article "Exploring the linguistic landscape of geotagged social media content in urban environments" published in Digital Scholarship in the Humanities
The article is published with an open license (CC BY 4.0) and can be found at the journal website here and in this repository.
The repository is organised into directories, which contain scripts related to various analyses conducted in the article.
The scripts are described in greater detail in their respective subfolders.
Directory | Description |
---|---|
langid | Scripts for automatic language identification |
plots | Scripts for analysing and plotting the data |
spatial | Scripts for analysing user mobility |
stats | Scripts for statistical analyses |
topics | Scripts for topic modelling |
utils | Utility scripts and dummy dataset for testing |
To use the scripts you need to have Python 3 installed with the required libraries. It's recommended to setup a virtual python 3 environment and install the required libraries:
pip install requirements.txt
The topic modelling script requires NLTK's stopwords and spaCy's model for english language, after installing requirements.txt run:
python -m nltk.downloader stopwords
python -m spacy download en
After installation run the scripts on your data or on the provided dummy dataset in the recommended order. For more information about the dummy dataset: scroll down and read About the dummy dataset section.
In the table below is the recommended order to run the scripts in this repo. The input/ouput names are examples, you will have to use the correct names for your data.
Step | Script | Input | Output |
---|---|---|---|
0 | get_fasttext_model.py | -- | langid/models/lid.176.bin |
1 | run_fasttext.py or run_langid.py | your_data.pkl | lid_data.pkl |
2 | location_history_creator.py | lid_data.pkl | lh_data.pkl |
3 | reverse_geocode.py | lh_data.pkl | revgeo_data.pkl |
4 | extract_languages+activities.py | revgeo_data.pkl | lochist_data.pkl |
5 | add_location_hist_to_df.py | lochist_data.pkl + lh_data.pkl | joined_data.pkl |
6 | topic_model_for_language+country.py | joined_data.pkl | LaTex table |
7 | scripts from stats or plots | joined_data.pkl | outputs vary (images, text) |
In step 1, your input data should be a pickled Pandas/GeoPandas DataFrame with matching column names from the scripts.
Compatibility issues: Windows compatibility is an issue. skbio (a library for required for diversity indices) does not work on Windows operating systems. pyfasttext and its dependencies (mainly cysignals) can be difficult to get to work on Windows operating systems.
The dummy dataset is intended strictly for testing the scripts. It is a pickled Pandas DataFrame, which contains dummy user_ids, photo_ids, caption texts, timestamps and geometries. The caption texts were generated using a recurrent neural network implemented in Keras using actual Instagram captions from the Helsinki area as training data. The captions are monolingual and the languages in question are the 10 most frequently used languages in Instagram captions from the Helsinki area. User_ids, photo_ids, timestamps and geometries are all randomly generated and thus statistical tests on and plotting made with the dummy dataset will reflect the randomness.
If you use these scripts in your research, please cite the following reference:
Hiippala, Tuomo, Hausmann, Anna, Tenkanen, Henrikki and Toivonen, Tuuli (2019) Exploring the linguistic landscape of geotagged social media content in urban environments. Digital Scholarship in the Humanities 34(2): 290-309. DOI: 10.1093/llc/fqy049