Emotion research

Research about emotions in Arabic dialects

Notebooks

Notebook	Description	Notes
generate_embeddings.ipynb	Trains and saves unsupervised embeddings to embeddings/ folder using the SMADC dataset in the data folder
find_centroids.ipynb	Attempts to cluster emotions from embeddings, then generate similar words from each emotion cluster.	Requires running generate_embeddings.ipynb
examine_lexicon.ipynb	Analyzes lexicon manually obtained in find_centroids.ipynb
mapping_embeddings_to_other_dialects.ipynb	Attempts to map embeddings cross dialectically	Is not supported by the environment in env.yml

Folder	Description	Notes
data	Data sources
dialect_lexicon	Stores emotion lexicon txt files in the named in the following format "[DIALECT]_[EMOTION].txt"	Incomplete
embeddings	Stores embeddings	Generated by generate_embeddings.ipynb
preprocessed_data	Stores dialect txt files preprocessed (e.g. stemmed)

Using conda import environment using conda env create --file env.yml then activate it using conda activate emotion_research

Windows users facing issues installing fasttext can download fasttext binaries from https://www.lfd.uci.edu/~gohlke/pythonlibs/#_fasttext then run pip install [WHEEL_FILE]

Dataset	Source
SMADC	Areej Alshutayri and Eric Atwell. Classifying arabic dialect text in the social media arabic dialect corpus (smadc). 01 2021.
AOC-dialectal-annotations	Ryan Cotterell and Chris Callison-Burch. A multi-dialect, multigenre corpus of informal written Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 241–245, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA).
annotated_data	Omar F. Zaidan and Chris Callison-Burch. The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 37–41, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
Dart	Israa Alsarsour, Esraa Mohamed, Reem Suwaileh, and Tamer Elsayed. DART: A large dataset of dialectal Arabic tweets. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA).
extra_data	Us

This research is an extension of this research