Data and code for 'Ancient Chinese Machine Reading Comprehension Exception Question Dataset with a Non-trivial Model', PRICAI-2023
link: https://link.springer.com/chapter/10.1007/978-981-99-7022-3_14
The ACRE dataset is in '3_DATASET/ACRC'
Train | dev | Test | total | |
---|---|---|---|---|
Number of Dataset | 3286 | 407 | 407 | 4100 |
The format of ACRE dataset is as follows. Sample Attributes
{
"version": "VGaokao-test", // Dataset Version
"data": [
{
"cid": 3, // Context_id (primary key, a unique tag for a piece of data)
"context": "诸子之学,兴起于先秦,当时一...", // context
"dynasty-gpt":"唐朝", // Calling the dynasty divided by chatgpt
"dynasty-glm":"唐朝", // Calling the dynasty divided by chatglm
"trans_context":"xxxxxxxxxxx" //Translated into ancient chinese in modern chinese(Using Microsoft azure translator)
"qas": [
{
"qid": "6", // Question_id
"question": "下列...不正确的一项是", // Question
"trans_question": "xxxxxxxxxx", // The question of translating into ancient texts(Using Microsoft azure translator)
"options": [ // ABCD four options
"广义上的...",
"“照着讲...",
"“接着讲...",
"不同于以..."
],
"trans_options": [ // Four options for translation into ancient languages
"广义上的...",
"“照着讲...",
"“接着讲...",
"不同于以..."
],
"answer": "D", // Answers to the options
"correctness": [ // The correctness of the options is irrelevant to the question (an option statement is 1 if it is right and 0 if it is wrong)
1,
1,
1,
0
]
}
]
}
]
}
(EVidence-first bERt encodinG with entiRE-tExt coNvolution)
- python = 3.8.13
The versions of important packages are listed here and can be installed according to environment.txt
- torch=1.11.0+cu113
- transformers=4.19.4
- openpyxl=3.1.1
- numpy=1.22.4
- pandas=1.4.3
Note: Since the experiments of this model run faster, but there are many different parameter combinations that need to be tested, a batch run mode is added. That is, an excel file is used to configure different parameters to set up the task, and one line in the excel is an experimental task (an experimental task is the complete process of training->validation->testing).
So there are two ways to run: (i) run main.py to run a single task; (ii) run run_shell.py to run a batch of multiple tasks.
Where should the excel file be placed (you can specify a specific file; you can also specify a directory to get the most recent excel inside the modification time)
1_RUN_EXP -- Entry point for running experiments
2_DATA_PREPROCESS -- Dataset processing
3_DATASET -- Dataset storage
4_RESULT -- Stores the results of the experiment
test -- Test file
utils -- For utilities
fewshot_learning -- chatgpt/chatglm does fwshot learning, which actually calls the interface
models -- Catalog of EVERGREEN models covering multiple branching inputs
models_cnn -- The EVERGREEN model is converted to a four-branch input model using multicore convolution and dynamic convolution.
models_compare -- Monolithic models for comparison as well as ensemble models
other python file -- Code Run Logic
ACRC -- ACMRC dataset
evidence_extract -- Logic related to evidence extraction
-- top1_sim1.py -- The extraction logic that is ultimately used, everything else is an attempt at this process
extract -- Logic for key sentence extraction
statistics -- The logic for the statistics of the dataset
translate -- Translation tools, baidu's translation api and Microsoft Azure's translator
1_format_data_to_json.py -- Formatting the raw data
2_translate.py -- Executes the translation logic
3_context_fwe.py -- Logic for function word (fwe.txt) deletion
4_context_extract.py -- Logic for extracting previous key sentences
5_data_divide.py -- The logic of dataset division, that is, according to the different lengths of articles, split into training set, validation set, test set
6_answer_rebalance.py -- Option balancing, i.e., to make the number of ABCDs in the training, validation, and test sets as consistent as possible.
ACRC数据处理逻辑_1.jpg -- Flowchart of dataset processing
other -- Other related files that may be referenced
other_dataset_format2ACRC -- Format other datasets into ACMRC dataset format, to facilitate the unification of code logic
ACRC -- ACMRC dataset
202306 -- June 2024 version, the version of the dataset used for the bios and submissions
new_data_0504 -- No evidence extraction version added
top1_sim -- Evidence extraction added to new_data_0504
original -- evidence extraction with original
fwe_context -- Evidence extraction with text after function word deletion
tran_context -- Evidence extraction with translated text.
tran_options -- Evidence extraction with translated options
original_data -- original dataset collected
C3 -- C3 dataset (processed into ACMRC format)
GCRC -- GCRC dataset (processed into ACMRC format)
NCR -- NCR dataset (processed into ACMRC format)
VGAOKAO -- VGAOKAO dataset (processed into ACMRC format)
EXCEL file -- The excel file specified by run_shell.py
Other -- Stores the results file
Logging utilities, excel read/write utilities, regular utilities, time utilities and other related utilities
main.py (load runtime parameters) -> config.py (wrap and preprocess parameters) -> data_processor.py (convert text to word embeddings) -> data_loader_multiprocess.py (multi-threaded assembly of model's input branches) -> train.py (train) -> predict.py (predict)
分支1 : 512(question + options + passage_1);
分支2_1 : 512(question + options + passage_1)+ 512(passage);
分支2_2 : 512(question + options)+ 512(passage_1);
分支3_1 : 512(question + options)+
512(passage_1)+
512(passage_2);
分支3_2 : 512(question + options + passage_1)+
512(question + options + passage_2)+
512(question + options + passage_3);
分支4 : 512(question + options_1 + evidence_1(evidence))+
512(question + options_2 + evidence_2(evidence)+
512(question + options_3 + evidence_3(evidence)+
512(question + options_4 + evidence_4(evidence)