Python libraries can be installed (preferably using a virtual environment such as virtualenv or conda) using pip, with the provided requirements.txt
file:
pip install -r requirements.txt
Prepend sudo
command if necessary.
Study participants' identifying information will not be released, including their Facebook status updates and user IDs. Feature representations necessary to replicate Analysis 1 and Analysis 2 are provided in the above OSF link.
doc_text_dataset.tsv
and subject_text_dataset.tsv
are not provided, though each file is assumed to have a text
column containing cleaned text (type str
).
+-- data
| +-- processed
| | +-- doc_text_dataset.tsv # one line per post; multiple posts per participant
| | +-- subject_text_dataset.tsv # one line per participant; posts concatenated
| +-- features
| | +-- public_doc_dataset.tsv # one line per participant; from "make_features.py"
| | +-- public_subject_dataset.tsv # one line per participant; from "make_features.py"
Public datasets can be downloaded from OSF.
Though we do not release the original text, we provide the code used to produce the feature sets (from a given text dataset) used in both analyses.
To produce all 9 feature sets (word counting MFD, MFD2, and LIWC; LDA; average GloVe vectors; DDR for MFD, MFD2, and LIWC; average BERT vectors), run make_features.py
from shell, substituting paths to dictionary files (*.dic
) and path to GloVe vectors (*.txt
).
Before make_features.py
can be called to produce LDA vectors, an LDA model must be fit. This can be done by calling fit_lda.py
, again substituting the correct path to a file containing a text
column (with cleaned text). This requires Mallet to be downloaded and an environment variable MALLET_PATH
set in ~/.bashrc
or ~/.profile
.
GloVe features require downloading the text file(s) containing GloVe vectors. The vectors from Wikipedia+Gigaword (6B.300d) were used in this paper.
Regressions for each feature set can be run from shell with python study1_regressions.py
. To replicate analyses contained in the supplemental materials (varying minimum number of posts for removing participants; aggregation of posts) uncomment the relevant section in study1_regressions.py
.
Post hoc tests of the resulting explanatory coefficients are contained in analysis1_posthoc.R
.
Exploratory analyses of the influence of moral concerns on produced categories of language (MFD2 ~ moral concerns, LIWC ~ moral concerns, LDA ~ moral concerns) are found in analysis2.R
.