A bilingual goal-oriented agent that can converse in Spanish–English code-switching with human users. An accompanying dataset.
Ahn, E., Jimenez, C., Tsvetkov, Y., & Black, A. (2020). What Code-Switching Strategies are Effective in Dialogue Systems?. Proceedings of the Society for Computation in Linguistics, 3(1), 308-318. PDF
Email Emily at eahn [at] uw.edu with questions.
The data/
folder contains 10 folders (each corresponding to 1 batch released on crowdsourcing platforms).
Each folder contains the following 6 files:
- chat.json: pure chats
- surv.json: qualitative survey (key to questions given in file
surv_questions_list.txt
) - qual.tsv: qualitative survey in tsv form
- lid.tsv: hand-annotated Language ID of each token from user.
- {0 = SP, 1 = EN, 2 = neither}
- NOTE: all tokens in these LID files have been lower-cased.
- .html: visualization of chats and surveys
To begin processing and poking into the data with python, use methods defined in processing_tools.py
, mainly load_all_data("data/files_list_com.txt")
.
Python file was written with Python 2.7 but should be compatible with Python 3.
To visualize all data according to Agent Strategy, see html files in viz_batches/
.
Chats are redundant to ones in data/*/*.html
, but simply organized differently.
-
strategy_map.txt
informs mapping of Agent strategy ("style") from the paper to these files.- We originally had a different naming convention of these strategies.
- For example, Insertional EN > SP (Spanish as matrix language, with insertions of English, the embedded language) was "SP lex", under the broader strategy of "Content" code-switching.
-
data/bot_lid_tags.tsv
anddata/files_list_bot.txt
are for provided if you want to account for the words generated by the agent. The second file can be loaded with the processing tools.
Emily's Spanish-English fork of the original English-only MutualFriends task is still under construction. Please reach out if you plan to use it! We have collaborators working on a Hindi-English extension of this system!