facebookresearch/XNLI

XNLI Arabic data read problem

Opened this issue · 0 comments

Following this and this script,
I tried to load Arabic data by the following script.

# data paths
MAIN_PATH=$PWD
OUTPATH=$PWD/data/xnli
XNLI_PATH=$PWD/data/xnli/XNLI-1.0

# tools paths
TOOLS_PATH=$PWD/tools
TOKENIZE=$TOOLS_PATH/tokenize.sh
LOWER_REMOVE_ACCENT=$TOOLS_PATH/lowercase_and_remove_accent.py

# install tools
./scripts/install-tools.sh

# create directories
mkdir -p $OUTPATH

# download data
if [ ! -d $OUTPATH/XNLI-MT-1.0 ]; then
  if [ ! -f $OUTPATH/XNLI-MT-1.0.zip ]; then
    wget -c https://dl.fbaipublicfiles.com/XNLI/XNLI-MT-1.0.zip -P $OUTPATH
  fi
  unzip $OUTPATH/XNLI-MT-1.0.zip -d $OUTPATH
fi
if [ ! -d $OUTPATH/XNLI-1.0 ]; then
  if [ ! -f $OUTPATH/XNLI-1.0.zip ]; then
    wget -c https://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zip -P $OUTPATH
  fi
  unzip $OUTPATH/XNLI-1.0.zip -d $OUTPATH
fi


# English train set
for lg in ar; do
  echo "*** Preparing $lg train set ****"
  echo -e "premise\thypo\tlabel" > $XNLI_PATH/$lg.train
  sed '1d'  $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f1 | python $LOWER_REMOVE_ACCENT > $XNLI_PATH/train.f1
  sed '1d'  $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f2 | python $LOWER_REMOVE_ACCENT > $XNLI_PATH/train.f2
  sed '1d'  $OUTPATH/XNLI-MT-1.0/multinli/multinli.train.$lg.tsv | cut -f3 | sed 's/contradictory/contradiction/g' > $XNLI_PATH/train.f3
  paste $XNLI_PATH/train.f1 $XNLI_PATH/train.f2 $XNLI_PATH/train.f3 >> $XNLI_PATH/$lg.train
done

Now the lines from 390702-392701 in $XNLI_PATH/train.f2 are empty. So from header premise\thypo\tlabel, hypo will always be empty from 390702-392701 in ar.train file.

Is this correct behavior?

@aconneau