/CSify

Converting Monolingual Text into Code-Switched Text with Dependency Tree

Primary LanguagePythonMIT LicenseMIT

CSify

Generate code-switched texts from monolingual texts.

If you got here by DOI citation on papers, this might be a snapshot of the repository during the time of writing. The latest release can be found below.

GitHub version PyPI version DOI

This repository is an implementation of our paper "Generating Code-Switched Text from Monolingual Text with Dependency Tree," accepted for publication at ALTA 2022.

The demo for this code is available here.

In this documentation, we define the notation [X]-[Y] as code switched sentence with X language as the base language and Y language as the inserted language. We use ISO 639-1 Code for our naming convention. For example, JA-KO means a Japanese-Korean code switched text generated from a monolingual Japanese text.

Setup

This package is available at PyPI. You can install with pip.

pip install csify

This package only comes with spaCy and contains no machine translator.

The CSify Class

The CSify class generates code-switched text from a monolingual base sentence by translating parts of it to the language you want to insert via the translate function. You need to bring your own machine translator. Here is an example code on generating EN-JA code-switched sentence using DeepL API.

from csify import CSify
import deepl

# Initialize DeepL machine translator
translator = deepl.Translator("<deepl_apikey>")

EN_TO_ENJA = {
  "spacy_model": "en_core_web_sm",
  "translate_func": lambda base_sentence:
  translator.translate_text(base_sentence, target_lang="JA").text.strip("。"),
  "space": ' '
}

code_switcher = CSify(**EN_TO_ENJA)
print(code_switcher.generate("your last report was more than two weeks ago."))
print(code_switcher.generate("our lives are not our own, from womb to tomb, we're bound to others."))

outputs

your last report was 二週間以上前 .
私たちの人生は、自分だけのものではないのです、胎内から墓場まで , we 're bound to others . 

Upon initialization, the CSify class takes three arguments:

  • spacy_model: The spaCy trained pipeline of the base sentence's language (e.g. "en_core_web_sm" for English). Here is the list of available pipelines. Note that the pipeline MUST support dependency parsing. There is no need to download the spaCy pipeline beforehand. The Csify class will do it for you.
  • translate_func : An str -> str function. It takes a text of the base sentence's language as input and outputs the input's inserted language translation. Wrap the machine translator's translate function to a new function. It is recommended to truncate all kinds of punctuation of the inserted language in this function as most of the translation will be done on subsentences, not complete sentences.
  • space : default=' '. Word separator of the base language. Some languages, such as Chinese and Japanese, don't use space. In that case, space should be an empty string.

If you are using DeepL or Google Cloud Translation API, there are already some pre-built function arguments for CSify class at demo/deepl_args.py and demo/google_translate_args.py respectively. For example, to generate EN-ZH with DeepL, the CSify function arguments look something like this

EN_TO_ENZH = {
  "spacy_model": "en_core_web_sm",
  "translate_func": lambda base_sentence:
  translator.translate_text(base_sentence, target_lang="ZH").text,
  "space": ' '
}

Adding More Language Pairs

Adding more language pairs equates to adding a function argument combination for the Csify class. Do note that base sentences can only be from languages that have Spacy trained parser pipeline. You can even bring your own machine translator. The following code is an example template of using your custom machine translator to create DE-SV code-switched sentences.

from csify import CSify
from my_awesome_translator import german_to_swedish_translator

my_translator = german_to_swedish_translator()
my_code_switcher_args = {
  "spacy_model": "de_core_news_sm",
  "translate_func": lambda base_sentence:
  my_translator.my_translate_function(base_sentence),
  "space": ' '
}
code_switcher = CSify(**my_code_switcher_args)
print(code_switcher.generate("Mein Name ist Sam, obwohl er kurz für Samantha ist."))

Setup - Demo

⚠️ WARNING
Warning: The JESC demo translates around 100,000 characters. Pay attention to your API character limit!
  • Clone this repository
git clone https://github.com/Selubi/CSify.git
  • Install library dependencies
pip install -r requirements.txt

Setup either DeepL API or Google Cloud Translation AI or both as machine translators. Alternatively, you can bring your own machine translator. Refer to The CSify Class and Adding More Language Pairs for more details.

deepl_apikey = "<insert deepl API key here>"
  • For Google Cloud Translation AI, follow this setup guide until "Create a service account key." You should get a JSON file. Save the JSON file and insert the path to it in demo/constants.py.
path_to_google_cloud_JSON_key = "<insert path to google cloud JSON key here>"
⚠️ WARNING
It is recommended to assume constants.py as unchanged in git to prevent API key leakage.
git update-index --assume-unchanged demo/constants.py

DeepL is relatively easier to set up but has less supported language than Google Cloud Translation AI.

Demo: Generating EN-JA and JA-EN from JESC Corpus

Refer to the below snippet of demo/main.py.

    """
    This demo function below is defined at ./demo.py
    It downloads and extracts the JESC split corpus, a parallel Japanese-English monolingual corpus.
    Of the extraction results located at ./data/split, we will take the test data (./data/split/test) that contains
    2000 lines and generate code-switched data from it.
    The result will be in 2 files:
    English sentences and code-switched sentences generated from it will be stored in ./data/CSified/EN-Code-Switched
    Japanese sentences and code-switched sentences generated from it will be stored in ./data/CSified/JA-Code-Switched
    This demo also features a progress bar that tracks how many sentences it has generated and its speed in 
    it/s (sentences per second).
    """
demo.generate_jesc_cs()
⚠️ WARNING
Warning: this demo translates around 100,000 characters. Pay attention to your API character limit!