The LGBTQ+ Minority Stress on Social Media (LGBTQ+ MiSSoM) dataset is the largest text-based, natural language processing (NLP) dataset on expressions of minority stress. The data are posts and comments from Reddit.com.
Some of the initial code in earlier phases of dataset creation, such as downloading the data via PushShift and establishing inter-coder reliability, can be found (here)[https://github.com/CJCascalheira/rise-ml-ms].
-
MiSSoM = the public dataset with features and labels, but no raw text. You can access the public dataset here.
-
MiSSoM+ = the private dataset with raw text. You can access the private dataset by emailing cjcascalheira@gmail.com, registering your study idea, and signing an agreement to keep the private dataset off on public-facing servers.
- src/extract_tagtog/
- src/clean/
- preprocess_binary_v1.R
- preprocess_binary_v2.R
- preprocess_annotated.R
- src/machine_annotate/
- src/create_features/
- src/analyze/
- src/util/ scripts for utility
- src/pull_subsets/ scripts to manage data for other scientists