/LALM

code and resource for ACL2021 paper 'Multi-Lingual Question Generation with Language Agnostic Language Model'

Primary LanguagePython

The source code for Multi-Lingual Question Generation with Language Agnostic Language Model

This repo is the source code for our ACL/IJCNLP2021 finding paper Multi-Lingual Question Generation with Language Agnostic Language Model

Pre-processing

1.Download and precess the wikidumps

First of all, you should download the wikipedia dumps from https://dumps.wikimedia.org/ , basicly, there are 10 languages used in this paper for pre-training.

Language Short name Size
Chinese zh 1.4G
English en 14G
Korean ko 679M
French fr 4.4G
Hindi hi 430M
Burmese bu 208M
German de 5.8G
Vietnam vi 979M
Japanese Ja 2.8G
Chinese Minnan mi 124M

Note that the number of pre-training languages could be larger than the fine-tuning languages.

You should downloads the original wikidumps file containing the whole articles, such as jawiki-20200420-pages-articles.xml.bz2. And then using WikiExtractor to extract the paragraphs. For example:

python3 WikiExtractor.py  -b 100m -o raw/ jawiki-20200420-pages-articles.xml.bz2

This extract the text of japanese wikidumps into the raw directory.

2. Merge the text into a single file

Then we merge the wiki text files processed by Wikextractor into a single file, which are then used to train the sentencepiece tokenizer in [step3](###3. Train the sentencepiece tokenizer) The code we provided is preprocess/process_wiki_data.py

def check_valid_wiki_passage(txt):
    if txt.startswith('<doc id'):
        return False
    if txt.startswith('</doc>'):
        return False
    if len(txt) < 10:
        return False
    return True
  
def get_all_wiki_data(language='english'):
    raw_file_paths = get_dir_files('data/{}/raw'.format(language))
    data = []
    for one_file in tqdm(raw_file_paths):
        for line in get_file_info(one_file):
            if check_valid_wiki_passage(line):
                data.append(line.strip())
    output_filename = 'data/{}/wiki.all.txt'.format(language)
    print('dump {} wiki total size is {}'.format(language, len(data)))
    write_lst_to_file(data, output_filename)
    print('{} done!'.format(language))

Where we get the text between two xml tags and filter out the lines shorter than 10.

3. Train the sentencepiece tokenizer

For each language, we use google's sentencepiece library to learn the tokenizer automatically. We use the unigram language model. And the code is preprocess/process_wiki_data.py

def train_vocab(language='english', vocab_size=30000):
    sp_path = 'data/{}/wiki.all.txt'.format(language)
    content = '--input=' + sp_path + ' ' \
                                     '--model_prefix=/search/odin/bingning/data/LALM/language_slot/vocab.my_size --vocab_size=my_size ' \
                                     '--character_coverage=0.9999 ' \
                                     '--num_sub_iterations=2 ' \
                                     '--max_sentencepiece_length=36 ' \
                                     '--model_type=unigram --num_threads=40 --max_sentence_length=15000 ' \
                                     '--input_sentence_size=2000000 '

    content = content.replace('my_size', str(vocab_size))
    content = content.replace('language_slot', language)
    spm.SentencePieceTrainer.Train(
        content
    )

where the language parameters are the respective language processed in previous step. The default vocabulary size is 30,000.

Pre-training

After processing the original wiki dumps and obtaining the tokenizer, we then pretrain the LALM.

There are two types of model we can pre-train,

  • LALM_shared, where we don't discriminate any languages, and the low-level module is directly fed to the high-level module.
  • LALM, we add a discriminator and adopt adversarial training to make the high-level module more language-agnostic.

Our pre-training code is provided in train/pre_train_lalm.py . And you can run the script by :

sh run_pre_train.sh

where the run_pre_train.sh is

#!bin/bash

python3 -m torch.distributed.launch --nproc_per_node=4 pre_train_lalm.py \
--batch_size=24 \
--max_learning_rate==1e-4 \ 
--max_length==512 \
--n_embedding==128 \
--n_hidden==768 \
--n_layer==12 \
--n_head==12 \
--reload=False \
--epoch=-1 \
--type=shared \
--zh_path='data/chinese/wiki.all.txt' \
--en_path='data/chinese/wiki.all.txt' \
--ko_path='data/korean/wiki.all.txt' \
--fr_path='data/french/wiki.all.txt' \
--hi_path='data/hindi/wiki.all.txt' \
--bu_path='data/burmese/wiki.all.txt' \
--de_path='data/german/wiki.all.txt' \
--vi_path='data/vietnam/wiki.all.txt' \
--ja_path='data/japanese/wiki.all.txt' \
--mi_path='data/minnan/wiki.all.txt'

Where the reload parameter means whether to reload previous pre-trained model from checkpoints, and epoch is the pre-load checkpoint's number.

type can be either shared or lalm

We also provid the training curve from our enviroments, where we 4 NVIDIA-V100 16Gb GPUs.

learning_curve

Note that the final average loss is around 3.5 for LALM_share.

  • you can download our pre-trained model via: base and large .

Fine-tuning

1.Preprocess the training question generation data

First of all, we should pre-process the question answering datasets. We process the data and transform it into a list of items, where each item is [context, document_features, questions] and each question is [question, question_features]. The process script is in preprocess directory, an example of English SQuAD is shown below.

sp = spm.SentencePieceProcessor()
sp.load('../data/vocab/english.30000.model')


def one_paragraphs(paragraphs):
    one_data = []
    for paragraph in paragraphs["paragraphs"]:
        context = paragraph['context']
        doc_ids = sp.EncodeAsIds(context)
        questions = []
        for qa in paragraph["qas"]:
            question_text = qa["question"]
            question_ids = sp.EncodeAsIds(question_text)
            questions.append([question_text, question_ids])
        one_data.append([context, doc_ids, questions])
    return one_data


def process(filename):
    with open(filename) as dataset_file:
        dataset_json = json.load(dataset_file)
        dataset = dataset_json['data']
    output = multi_process(one_paragraphs, dataset, num_cores=40)
    output = [y for x in output for y in x]
    if 'train' in filename:
        output = [[x[1], y[1]] for x in output for y in x[2]]
    print('{} proceed done, have {} samples'.format(filename, len(output)))
    return output


def get_squad():
    dev = process('../data/qg/english.dev.json')
    dump_file(dev, '../data/qg/dev.en.obj')
    train = process('../data/qg/english.train.json')
    dump_file(train, '../data/qg/train.en.obj')


if __name__ == '__main__':
    get_squad()

Since we do not hold the licence to distribute the data, so we refer them below so you can obtain this data by yourself.

Language Training data size Dev data size url Info
English 87,599 2,067 https://rajpurkar.github.io/SQuAD-explorer/ english.dev.json/english.train.json, V1.1
Korean 60,407 964 https://github.com/graykode/KorQuAD-beginner/tree/master/config
French 20,731 768 https://fquad.illuin.tech/
Hindi 4,000 2,555 https://www.cse.iitb.ac.in/~ganesh/HiQuAD/clqg/clqg_data.tar.gz decompress the tar file into hindi directory.
Chinese 180,000 44,962

2. Fine-tuning

todo

Reference

If you wish to use our data in your research, please cite:

@article{wangmulti,
  title={Multi-Lingual Question Generation with Language Agnostic Language Model},
    booktitle = "Findings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)",
    month = Aug,
    year = "2021",
    address = "Virtual",
  author={Wang, Bingning and Yao, Ting and Chen, Weipeng and Xu, Jingfang and Wang, Xiaochuan}
}