/DASK

This is the official implmentation of Domain-Adaptive Text Classification with Structured Knowledge from Unlabeled Data (IJCAI 2022 Long Oral)

Primary LanguagePython

graph-causal-domain-adaptation

requirements

pip install requirements.txt

Prepare

  • Download Bert-base-uncased pretrain weights from here, or see a list of Bert model weights download links here
  • Download corresponding vocabulary here. Note that the downloaded tar also contains the tensorflow pretrained model weights, but we only need the file vocab.txt
  • Put the pretrained model file, the config json downloaded from the first step, and the vocabulary to models/pytorch-bert-uncased directory.
  • Download imdb dataset here and put it to data/imdb
  • Download bdek dataset(i.e. amazon reviews dataset) here and put it to data/bdek

Run

  • run sh train_script.sh in shell
    • open this file and you'll see different commands for different tasks

To develop

  • The start point of the program is train.py
  • Files like trainers.py, evaluators.py, model.py, dataset.py , etc., defines classes for the corresponding component of the program, and is imported to train.py by xx_factory at the bottom of each file.
  • Developer should add new classes to these files to implement new features instead of editting the existing ones.
  • There are several command line args that effect which module to choose from the factories, see the code for details.