-
Given an input
(q, r, s)
, whereq
is a long article,r
is short article responding toq
, ands
is the discussion relationship betweenr
andq
, which may be an aggrement or diasgreement (agree or disagree). -
The output data is a double tuple (
q'
,r'
).q'
andr'
are the subsequences ofq
andr
, respectively. -
q'
andr'
provide key information to judgeq
.r'
r
presentes the relation ofs
. -
Our task is to predict the
q'
andr'
given an inputq
,r
, ands
.
- The first data processing step simply removes any leading and trailing
character in the
q
,r
,q’
, andr’
features. This step is important in order to make the model really fed by clean data. - We do the 50% upsampling for the training data in order to provide more data to our model. We have tried another augmentation strategy such as using synonym words from the existing data, but 50% upsampling works best for us. This upsampling method is done by sampling 50% of the data with stated random state for reproducibility.
- We treat this Interpretive Information Labeling Project as a question answering
task. Hence, we need to design a system that can do the extraction in order to
make target labels. There is a slight modification from the traditional question
answering, since there is s feature, the discussion relationship between
r
andq
. In order to do that, we just place thes
feature in front of ther
feature with format “s:r
” to become a newr
feature. - The next step is to do the index extraction both for
q
andr
with respect toq’
andr’
. - Our final features to train the model are
q
,r
,q’
,r’
,q_start
,r_start
,q_end
, andr_end
. The*_start
and*_end
features are the start and end index as mentioned in the previous point. - We split the final data with the percentage of 90:10 for the training and
validation data, respectively. - The next step is tokenizing the data with the help of BertTokenizer. To add the
tokenized version of start and end positions to the encoded features, we just
simply use the char_to_token function. If the
q
orr
start position is nothing, we just simply treat theq
orr
start and end position with 0. Finally, we got the encoded features with keys ofinput_ids
,token_type_ids
,attention_mask
,q_start
,r_start
,q_end
, andr_end
. The final step is just simply transforming the encoded features to tensor.
- Our model architecture consists of BERT as a backbone, followed by Bi-LSTM and Linear layer as a classifier. - First, BERT (bert-base-cased) works as a backbone for our model architecture and then 2 layers of Bi-LSTM processes the encoded representation from BERT. We argue that Bi-LSTM can better model the encoded representations by BERT. This Bi-LSTM is outputting a dimension of 256. This output from the Bi-LSTM layer is then fed to the dense layer with the dimension of 512. The final output is the dense layer with 4 outputs, which are q_start, r_start, q_end, and r_end.
- We use Cross Entropy Loss function with AdamW optimizer. We finetune the model for 2 epochs with learning rate 3e-5. We just finetune with that number of epochs since the model is starting to overfit if the number of epochs is greater than 2. We use batch size equal to 8 for both training and validation data.
Public Leaderboard | Private Leaderboard |
---|---|
0.803938 | 0.854274 |