/code-switching-papers

A curated list of research papers and resources on code-switching

Apache License 2.0Apache-2.0

Code-switching Research Resources

This is the list of tutorials, workshops, papers, and resources on computational linguistic approaches to code-switching research. The list will be updated over the time. You are welcome to send a pull request for updating the list and be one of the contributors!

📌 I plan to collect theses and books on code-switching and list them here. If you have one, don't hesitate to contact me or send a pull request!

🚀 Highlights

  • If you are new on code-switching or looking for a new research direction, we have written a comprehensive survey paper on code-switching: The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges [Paper]. Feel free to read and let us know if you have any suggestions! Thanks to Alham Fikri Aji, Zheng-Xin Yong, and Thamar Solorio to make this possible 😊
  • We are organizing the code-switching workshop at EMNLP 2023! [Website]
  • We (I, Marina Zhukova, and Sudipta Kar) organized a bird-of-a-feather session at EMNLP 2022 in Abu Dhabi. We have around 30 people joining (in-person and online). Thanks for coming!
  • 📔 There was a comprehensive tutorial about code-mixing by Microsoft Research (Monojit Choudhury, Kalika Bali, Anirudh Srinivasan, and Sandipan Dandapat) at EMNLP 2019, you can check the following link.

🏫 Workshops

This is the list of the code-switching workshop series:

  • First Workshop on Computational Approaches to Code-switching, EMNLP 2014 [Website]
  • Second Workshop on Computational Approaches to Code-switching, EMNLP 2016
  • Third Workshop on Computational Approaches to Linguistic Code-switching, ACL 2018 [Website]
  • Fourth Workshop on Computational Approaches to Linguistic Code-switching, LREC 2020 [Website]
  • First Workshop on Speech Technologies for Code-switching in Multilingual Communities, Interspeech 2020 [Website]
  • Fifth Workshop on Computational Approaches to Linguistic Code-switching, NAACL 2021 [Website]
  • Sixth Workshop on Computational Approaches to Linguistic Code-switching, EMNLP 2023 [Website]

📑 Research Papers

Survey Paper

  • Winata, et al. (2023) The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges. ACL Findings [Paper]
  • Doğruöz, et al (2021) A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies. ACL [Paper]
  • Jose, et al. (2020) A Survey of Current Datasets for Code-Switching Research. International Conference on Advanced Computing and Communication Systems (ICACCS) [Paper]
  • Sitaram, et al. (2019) A Survey of Code-switched Speech and Language Processing. Arxiv [Paper]

Large Language Models

  • Yong, et al. (2023) Prompting Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages. Arxiv [Paper]

Language Identification and POS Tagging

  • Ostapenko, et al. (2022) Speaker Information Can Guide Models to Better Inductive Biases: A Case Study On Predicting Code-Switching. ACL [Paper]
  • Nguyen, et al. (2021) Automatic Language Identification in Code-Switched Hindi-English Social Media Text. Journal of Open Humanities Data [Paper]
  • Tarunesh, et al. (2021) From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text. ACL [Paper]
  • Gustavo Aguilar and Thamar Solorio. (2020) From English to Code-Switching: Transfer Learning with Strong Morphological Clues. ACL [Paper] [Code]
  • Mager, et al. (2019) Subword-Level Language Identification for Intra-Word Code-Switching. NAACL [Paper]
  • Zhang, et al. (2018) A Fast, Compact, Accurate Model for Language Identification of Codemixed Text. EMNLP [Paper]
  • Kelsey Ball and Dan Garrette. (2018) Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification. EMNLP [Paper]
  • Zeynep Yirmibesoglu and Gulsen Eryigit. (2018) Detecting Code-Switching between Turkish-English Language Pair. Workshop W-NUT, EMNLP [Paper]
  • Mavem, et al. (2018) Language Identification and Analysis of Code-Switched Social Media Text. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
  • Victor Soto and Julia Hirschberg. (2018) Joint Part-of-Speech and Language ID Tagging for Code-Switched Data. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
  • Bullock, et al. (2018) Predicting the presence of a Matrix Language in code-switching. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
  • Soto, et al. (2018) The Role of Cognate Words, POS Tags, and Entrainment in Code-Switching. Interspeech [Paper]
  • Barman, et al. (2016) Part-of-speech Tagging of Code-mixed Social Media Content: Pipeline,Stacking and Joint Modelling. 2nd Workshop on Computational Approaches to Code-Switching, ACL [Paper]
  • Vyas, et al. (2014) POS Tagging of English-Hindi Code-Mixed Social Media Content. EMNLP [Paper]
  • Heba Elfardy and Mona Diab. (2012) Token Level Identification of Linguistic Code Switching. COLING [Paper]
  • Thamar Solorio and Yang Liu. (2008) Learning to Predict Code-Switching Points. EMNLP [Paper]
  • Dau-Cheng Lyu and Ren-Yuan Lyu. (2008) Language Identification on Code-Switching Utterances Using Multiple Cues. Interspeech [Paper]

Corpus

  • Whitehouse, et al. (2022) EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching. EMNLP [Paper] [Code]
  • Lovenia, et al. (2022) ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation. LREC [Paper] [Dataset]
  • Nguyen, et al. (2020) CanVEC-the Canberra Vietnamese-English Code-switching Natural Speech Corpus. LREC [Paper]
  • Umapathy, et al. (2020) Investigating Modelling Techniques for Natural Language Inference on Code-Switched Dialogues in Bollywood Movies. First Workshop on Speech Technologies for Code-switching in Multilingual Communities, Interspeech 2020 [Dataset]
  • Xiang, et al. (2020) Sina Mandarin Alphabetical Words:A Web-driven Code-mixing Lexical Resource. AACL-IJCNLP [TBC]
  • Chakravarthi, et al. (2020) Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text. Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Computing for Under-Resourced Languages Workshop, LREC [Paper]
  • Khanuja, et al. (2020) A New Dataset for Natural Language Inference from Code-mixed Conversations. 4th Workshop of Computational Approaches to Linguistic Code-switching, LREC [Paper]
  • Barik, et al. (2019) Normalization of Indonesian-English Code-Mixed Twitter Data. W-NUT, EMNLP [Paper] [Dataset]
  • Singh, et al. (2018) A Twitter Corpus for Hindi-English Code Mixed POS Tagging. Sixth International Workshop on Natural Language Processing for Social Media, ACL [Paper]
  • Li, et al. (2012) A Mandarin-English Code-Switching Corpus. LREC [Paper]
  • Lyu, et al. (2010) SEAME: A Mandarin-English Code-Switching Speech Corpus in South-East Asia. Interspeech [Paper]
  • Lyu, et al. (2010) An Analysis of a Mandarin-English Code-switching Speech Corpus: SEAME. Age [Paper]

Language Modeling and Speech Recognition

  • Kumar, et al. (2020) Machine Learning based Language Modelling of Code Switched Data. International Conference on Electronics and Sustainable Communication Systems (ICESC) [Paper]
  • Madhumani, et al. (2020) Learning not to Discriminate: Task Agnostic Learning for Improving Monolingual and Code-switched Speech Recognition. Arxiv [Paper]
  • Shah, et al. (2020) Learning to Recognize Code-switched Speech Without Forgetting Monolingual Speech Recognition. Arxiv [Paper]
  • Winata, et al. (2020) Meta-Transfer Learning for Code-Switched Speech Recognition. ACL [Paper] [Code]
  • Chandu, et al. (2020) Style Variation as a Vantage Point for Code-Switching. Arxiv [Paper]
  • Ganji Sreeram and Rohit Sinha (2020) Exploration of End-to-End Framework for Code-Switching Speech Recognition Task: Challenges and Enhancements. IEEE Access [Paper]
  • Winata, et al. (2019) Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences. CoNLL [Paper]
  • Hila Gonen and Yoav Goldberg (2019) Language Modeling for Code-Switching:Evaluation, Integration of Monolingual Data, and Discriminative Training. EMNLP [Paper]
  • Lee, et al. (2019) Linguistically Motivated Parallel Data Augmentation for Code-switch Language Modeling. Interspeech [Paper]
  • Victor Soto and Julia Hirschberg (2019) Improving Code-Switched Language Modeling Performance Using Cognate Features. Interspeech [Paper]
  • Chang, et al. (2019) Code-switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation. Interspeech [Paper]
  • Zeng, et al. (2019) On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition. Interspeech [Paper]
  • Taneja, et al. (2019) Exploiting Monolingual Speech Corpora for Code-mixed Speech Recognition. Interspeech [Paper]
  • Shan, et al. (2019) Investigating End-to-end Speech Recognition for Mandarin-english Code-switching. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) [Paper]
  • Grandee Lee, Haizhou Li. (2019) Word and Class Common Space Embedding for Code-switch Language Modelling. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) [Paper]
  • Hamed, et al. (2019) Code-Switching Language Modeling with Bilingual Word Embeddings: A Case Study for Egyptian Arabic-English. International Conference on Speech and Computer [Paper]
  • Winata, et al. (2018) Learn to Code-Switch: Data Augmentation using Copy Mechanism on Language Modeling. Arxiv [Paper]
  • Winata, et al. (2018) Towards End-to-end Automatic Code-Switching Speech Recognition. Arxiv [Paper]
  • Nakayama, et al. (2018) Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS. IEEE Spoken Language Technology Workshop (SLT) [Paper]
  • Jesse Emond, Bhuwana Ramabhadran, Brian Roark, Pedro Moreno, and Min Ma. (2018) Transliteration Based Approaches to Improve Code-Switched Speech Recognition Performance, IEEE Spoken Language Technology Workshop (SLT) [Paper]
  • Ganji Sreeram and Rohit Sinha. (2018) Exploiting Parts-of-Speech for Improved Textual Modeling of Code-Switching Data. 2018 Twenty Fourth National Conference on Communications (NCC) [Paper]
  • Garg, et al. (2018) Code-switched Language Models Using Dual RNNs and Same-Source Pretraining. EMNLP [Paper]
  • Ewald van der Westhuizen and Thomas R. Niesler. (2018) Synthesised bigrams using word embeddings for code-switched ASR of four South African language pairs. Computer Speech and Language [Paper]
  • Biswal, et al. (2018) Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech. Interspeech [Paper]
  • Winata, et al. (2018) Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper] [Code]
  • Chandu, et al. (2018) Language Informed Modeling of Code-Switched Text. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
  • Pratapa, et al. (2018) Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data. ACL [Paper]
  • Sivasankaran, et al. (2018) Phone Merging For Code-Switched Speech Recognition. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
  • Garg, et al. (2018) Dual Language Models for Code Switched Speech Recognition. Interspeech [Paper]
  • Baheti, et al. (2017) Curriculum Design for Code-switching: Experiments with Language Identification and Language Modeling with Deep Neural Networks. ICON [Paper]
  • Adel, et al. (2015) Syntactic and Semantic Features For Code-Switching Factored Language Models. IEEE Transactions on Audio, Speech, and Language Processing [Paper]
  • Ying Li and Pascale Fung. (2014) Code switch language modeling with Functional Head Constraint. ICASSP [Paper]
  • Ying Li and Pascale Fung. (2014) Language Modeling with Functional Head Constraint for Code Switching Speech Recognition. EMNLP [Paper]
  • Adel, et al. (2013) Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling. ACL [Paper]
  • Adel, et al. (2013) Recurrent neural network language modeling for code switching conversational speech. ICASSP [Paper]
  • Vu, et al. (2012) A First Speech Recognition System for Mandarin-English Code-Switch Conversational Speech. ICASSP [Paper]
  • Ying Li and Pascale Fung. (2012) Code-switch Language Model with Inversion Constraints for Mixed Language Speech Recognition. COLING [Paper]
  • Li, et al. (2011) Asymmetric acoustic modeling of mixed language speech. ICASSP [Paper]

Discourse

  • Sravani, et al. (2021) Political Discourse Analysis: A Case Study of Code Mixing and Code Switching in Political Speeches. Proceedings of the 5th Workshop on Computational Approaches to Code Switching (CALCS), NAACL [Paper]

Generation

  • Gupta, et al. (2020) A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning. Findings of EMNLP [Paper]
  • Bryan Gregorius and Takeshi Okadome (2022) Generating Code-Switched Text from Monolingual Text with Dependency Tree. The 20th Annual Workshop of the Australasian Language Technology Association [Paper] [Code]

Speech Synthesis

  • Sai Krishna Rallabandi and Alan W Black (2019) Variational Attention using Articulatory Priors for generating Code Mixed Speech using Monolingual Corpora. Interspeech [Paper]
  • Sai Krishna Rallabandi and Alan W Black (2017) On Building Mixed Lingual Speech Synthesis Systems. Interspeech [Paper]
  • Chandu, et al. (2017) Speech Synthesis for Mixed-Language Navigation Instructions. Interspeech [Paper]

Metric

  • Guzman, et al. (2017) Metrics for modeling code-switching across corpora. Interspeech [Paper]

Representation Learning

  • Prasad, et al. (2021) The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding. Proceedings of the 1st Workshop on Multilingual Representation Learning, EMNLP [Paper]
  • Winata, et al. (2021) Are Multilingual Models Effective in Code-Switching?. Proceedings of the 5th Workshop on Computational Approaches to Code Switching (CALCS), NAACL [Paper]
  • Rizal, et al. (2020) Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data. Proceedings of the 4th Workshop on Computational Approaches to Code Switching (CALCS), LREC [Paper]
  • Winata, et al. (2019) Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition. EMNLP [Paper] [Code]
  • Pratapa, et al. (2018) Word Embeddings for Code-Mixed Language Processing. EMNLP [Paper]

Machine Translation

  • Gaser, et al. (2023) Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text. EACL [Paper]
  • Vivek Srivastava and Mayank Singh (2020) PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation. W-NUT, EMNLP [Paper] [Dataset]
  • Thoudam Doren Singh and Thamar Solorio. (2017) Towards Translating Mixed-Code Comments from Social Media. CICLing [Paper]

NLU

  • Krishnan, et al. (2021) Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent Prediction and Slot Filling. MRL, EMNLP [Paper]

Named Entity Recognition

  • Priyadharshini, et al. (2020) Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding. 6th International Conference on Advanced Computing and Communication Systems (ICACCS) [Paper]
  • Winata, et al. (2019) Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition. RepL4NLP, ACL [Paper] [Code]
  • Aguilar, et al. (2018) Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
  • Wang, et al. (2018) Code-Switched Named Entity Recognition with Embedding Attention. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
  • Winata, et al. (2018) Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition. 3rd Workshop of Computational Approaches to Linguistic Code-switching, ACL [Paper]
  • Aguilar, et al. (2017) A Multi-task Approach for Named Entity Recognition in Social Media Data. 3rd Workshop on Noisy User-generated Text, EMNLP [Paper]

Linguistics

  • Li Nyuyen. (2018) Borrowing or Code-switching? Traces of community norms in Vietnamese-English speech. Australian Journal of Linguistics 38.4 (2018): 443-466. [Paper]
  • Fairchild, Sarah, and Janet G. Van Hell. (2017) Determiner-noun code-switching in Spanish heritage speakers. Bilingualism: Language and Cognition 20.1 (2017): 150-161. [Paper]
  • Bhatt, Rakesh M., and Agnes Bolonyai. (2011) Code-switching and the optimal grammar of bilingual language use. Bilingualism: Language and Cognition 14.4 (2011): 522-546. [Paper]
  • Lipski (2005) Code-switching or Borrowing? No sé so no puedo decir, you know. Second Workshop on Spanish Sociolinguistics [Paper]
  • Roberto R. Heredia and Jeanette Altarriba (2001) Bilingual Language Mixing: Why Do Bilinguals Code-Switch? SAGE Publications [Paper]
  • Belazi, et al. (1994) Code switching and X-bar theory: The functional head constraint. Linguistic inquiry Vol 25 No.2 Spring [Paper]
  • Shana Poplack (1980) Sometimes i’ll start a sentence in spanish y termino en espanol: toward a typology of code-switching1. Linguistics 18(7-8) [Paper]
  • Pfaff, Carol W. (1979) Constraints on language mixing: intrasentential code-switching and borrowing in Spanish/English. Language: 291-318. [Paper]
  • Shana Poplack (1978) Syntactic structure and social function of code-switching. Vol. 2. Centro de Estudios Puertorriqueños, City University of New York [Paper]
  • Gumperz, J. J., & Hernandez, E. (1969) Cognitive aspects of bilingual communication. Institute of International Studies, University of California [Paper]

Affective Computing

  • Chakravarthi, et al. (2021) DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text. Arxiv [Paper] [Code and Dataset]
  • Siddharth Yadav (2020) Unsupervised Sentiment Analysis for Code-mixed Data. Arxiv[Paper] [Code]
  • Wang, et al. (2017) Emotion Analysis in Code-Switching Text With Joint Factor Graph Model. IEEE/ACM Transactions on Audio, Speech, and Language Processing [Paper]
  • Wang, et al. (2016) A Bilingual Attention Network for Code-switched Emotion Prediction. COLING [Paper]
  • Sophia Lee and Zhongqing Wang (2015) Emotion in Code-switching Texts: Corpus Construction and Analysis. Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing [Paper]
  • Wang, et al. (2015) Emotion Detection in Code-switching Texts via Bilingual and Sentimental Information. ACL [Paper]

Dialog and Conversational System

  • Gupta, et al. (2018) Uncovering Code-Mixed Challenges: A Framework for Linguistically Driven Question Generation and Neural based Question Answering. CoNLL [Paper]

Discourse

  • Sravani, et al. (2021) Political Discourse Analysis: A Case Study of Code Mixing and Code Switching in Political Speeches. CALCS Proceedings of the 5th Workshop on Computational Approaches to Code Switching (CALCS), NAACL [Paper]

Syntax

  • Kodali, et al. (2022) SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing. Findings of ACL [Paper]
  • Özlem Çetinoglu and Çagrı Çöltekin (2019) Challenges of Annotating a Code-Switching Treebank. SyntaxFest [Paper]

Adversarial Attack

  • Samson Tan and Shafiq Joty (2021) Code-Mixing on Sesame Street: Dawn of the Adversarial Polyglots. NAACL [Paper]

Social Linguistics

  • Bolock, et al. (2020) Who, When and Why: The 3 Ws of Code-Switching. International Conference on Practical Applications of Agents and Multi-Agent Systems [Paper]
  • Yoder, et al. (2017) Code-Switching as a Social Act:The Case of Arabic Wikipedia Talk Pages. Proceedings of the Second Workshop on Natural Language Processing and Computational Social Science, ACL [Paper]
  • Agrawal, et al. (2017) Agarwal, Prabhat, et al. I may talk in English but gaali toh Hindi mein hi denge: A study of English-Hindi code-switching and swearing pattern on social networks. International Conference on Communication Systems and Networks (COMSNETS) [Paper]

Benchmark

  • Khanuja, et al. (2020) GLUECoS : An Evaluation Benchmark for Code-Switched NLP. ACL [Paper]
  • Aguilar, et al. (2020) LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation. LREC [Paper]

Social Media

  • Bali, et al. (2014) “I am borrowing ya mixing ?” An Analysis of English-Hindi Code Mixing in Facebook. Proceedings of The First Workshop on Computational Approaches to Code Switching [Paper]

Text Normalization

  • Dwija Parikh and Thamar Solorio (2021) Normalization and Back-Transliteration for Code­Switched Data. CALCS Proceedings of the 5th Workshop on Computational Approaches to Code Switching (CALCS), NAACL [Paper]

Toolkit

Synthetic Data Generation Toolkit

  • Jayanthi, et al. (2021) CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing. CALCS Proceedings of the 5th Workshop on Computational Approaches to Code Switching (CALCS), NAACL [Paper] [Code]
  • Rizvi, et al. (2021) GCM: A Toolkit for Generating Synthetic Code-mixed Text. EACL (System Demonstrations) [Paper] [Code]

Annotation Toolkit

  • Shah, et al. (2019) CoSSAT: Code-Switched Speech Annotation Tool. Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP [Paper]

Summarization

  • Mehnaz, et al. (2021) GupShup: Summarizing Open-Domain Code-Switched Conversations. EMNLP

Question Answering

  • Gupta, et al. (2020) A Unified Framework for Multilingual and Code-Mixed Visual Question Answering. AACL-IJCNLP [TBA]

Dialog and Conversational System

  • Bawa, et al. (2020) Do Multilingual Users Prefer Chat-bots that Code-mix? Let's Nudge and Find Out!. ACM on Human-Computer Interaction [Paper]
  • Banerjee, et al. (2018) A Dataset for Building Code-Mixed Goal Oriented Conversation Systems. COLING [Paper]

Position Paper

  • Nguyen, et al. (2022) Building Educational Technologies for Code-Switching: Current Practices, Difficulties and Future Directions. Languages [Paper]

Books

  • Caciullos and Travis (2018) Bilingualism in the Community. Cambridge University Press

Theses

  • Genta Indra Winata (2021) Multilingual Transfer Learning for Code-Switched Language and Speech Neural Modeling. [Thesis]
  • Gustavo Aguilar (2020) Neural Sequence Labeling on Social Media Text. [Thesis]
  • Victor Soto Martinez (2020) Identifying and Modeling Code-Switched Language. [Thesis]