XNLI Arabic data read problem
Opened this issue · 0 comments
sbmaruf commented
Following this and this script,
I tried to load Arabic data by the following script.
# data paths
MAIN_PATH=$PWD
OUTPATH=$PWD/data/xnli
XNLI_PATH=$PWD/data/xnli/XNLI-1.0
# tools paths
TOOLS_PATH=$PWD/tools
TOKENIZE=$TOOLS_PATH/tokenize.sh
LOWER_REMOVE_ACCENT=$TOOLS_PATH/lowercase_and_remove_accent.py
# install tools
./scripts/install-tools.sh
# create directories
mkdir -p $OUTPATH
# download data
if [ ! -d $OUTPATH/XNLI-MT-1.0 ]; then
if [ ! -f $OUTPATH/XNLI-MT-1.0.zip ]; then
wget -c https://dl.fbaipublicfiles.com/XNLI/XNLI-MT-1.0.zip -P $OUTPATH
fi
unzip $OUTPATH/XNLI-MT-1.0.zip -d $OUTPATH
fi
if [ ! -d $OUTPATH/XNLI-1.0 ]; then
if [ ! -f $OUTPATH/XNLI-1.0.zip ]; then
wget -c https://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zip -P $OUTPATH
fi
unzip $OUTPATH/XNLI-1.0.zip -d $OUTPATH
fi
# English train set
for lg in ar; do
echo "*** Preparing $lg train set ****"
echo -e "premise\thypo\tlabel" > $XNLI_PATH/$lg.train
sed '1d' $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f1 | python $LOWER_REMOVE_ACCENT > $XNLI_PATH/train.f1
sed '1d' $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f2 | python $LOWER_REMOVE_ACCENT > $XNLI_PATH/train.f2
sed '1d' $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f3 | sed 's/contradictory/contradiction/g' > $XNLI_PATH/train.f3
paste $XNLI_PATH/train.f1 $XNLI_PATH/train.f2 $XNLI_PATH/train.f3 >> $XNLI_PATH/$lg.train
done
Now the lines from 390702-392701 in $XNLI_PATH/train.f2
are empty. So from header premise\thypo\tlabel
, hypo
will always be empty from 390702-392701 in ar.train
file.
Is this correct behavior?