/some-lingscapes

A repository for the article "Exploring the linguistic landscape of geotagged social media content in urban environments" published in Digital Scholarship in the Humanities.

Primary LanguagePython

Introduction

The article is published with an open license (CC BY 4.0) and can be found at the journal website here and in this repository.

The repository is organised into directories, which contain scripts related to various analyses conducted in the article.

The scripts are described in greater detail in their respective subfolders.

Directory Description
langid Scripts for automatic language identification
plots Scripts for analysing and plotting the data
spatial Scripts for analysing user mobility
stats Scripts for statistical analyses
topics Scripts for topic modelling
utils Utility scripts and dummy dataset for testing

Usage

To use the scripts you need to have Python 3 installed with the required libraries. It's recommended to setup a virtual python 3 environment and install the required libraries:

pip install requirements.txt

The topic modelling script requires NLTK's stopwords and spaCy's model for english language, after installing requirements.txt run:

python -m nltk.downloader stopwords

python -m spacy download en

After installation run the scripts on your data or on the provided dummy dataset in the recommended order. For more information about the dummy dataset: scroll down and read About the dummy dataset section.

Recommended order of running scripts

In the table below is the recommended order to run the scripts in this repo. The input/ouput names are examples, you will have to use the correct names for your data.

Step Script Input Output
0 get_fasttext_model.py -- langid/models/lid.176.bin
1 run_fasttext.py or run_langid.py your_data.pkl lid_data.pkl
2 location_history_creator.py lid_data.pkl lh_data.pkl
3 reverse_geocode.py lh_data.pkl revgeo_data.pkl
4 extract_languages+activities.py revgeo_data.pkl lochist_data.pkl
5 add_location_hist_to_df.py lochist_data.pkl + lh_data.pkl joined_data.pkl
6 topic_model_for_language+country.py joined_data.pkl LaTex table
7 scripts from stats or plots joined_data.pkl outputs vary (images, text)

In step 1, your input data should be a pickled Pandas/GeoPandas DataFrame with matching column names from the scripts.

Compatibility issues: Windows compatibility is an issue. skbio (a library for required for diversity indices) does not work on Windows operating systems. pyfasttext and its dependencies (mainly cysignals) can be difficult to get to work on Windows operating systems.

About the dummy dataset

The dummy dataset is intended strictly for testing the scripts. It is a pickled Pandas DataFrame, which contains dummy user_ids, photo_ids, caption texts, timestamps and geometries. The caption texts were generated using a recurrent neural network implemented in Keras using actual Instagram captions from the Helsinki area as training data. The captions are monolingual and the languages in question are the 10 most frequently used languages in Instagram captions from the Helsinki area. User_ids, photo_ids, timestamps and geometries are all randomly generated and thus statistical tests on and plotting made with the dummy dataset will reflect the randomness.

Reference

If you use these scripts in your research, please cite the following reference:

Hiippala, Tuomo, Hausmann, Anna, Tenkanen, Henrikki and Toivonen, Tuuli (2019) Exploring the linguistic landscape of geotagged social media content in urban environments. Digital Scholarship in the Humanities 34(2): 290-309. DOI: 10.1093/llc/fqy049