/GeoRoBERTa

Primary LanguagePython

GeoRoBERTa

GeoRoBERTa is a Semantic Address Matching solution based on RoBERTa, a pretrained transformer-based language model. It leverages two types of geographical knowledge: (1) Address Tag Embedding and (2) Geohash Encoding.

We provide a light version of GeoRoBERTa : The pretraining of RoBERTa was performed, from scratch, on 50000 French addresses from the Sirene DataBase.

Requirements

  • Python 3.7.7
  • Pytorch 1.9
  • HuggingFace Transformers 4.9.2

Install required packages

pip install -r requirements.txt

Dataset

We provide a sample of the French (Address) dataset that had been used in the experiments: (1) train.tsv_tag_label: training dataset , (2) dev_matched.tsv_tag_label: validation dataset, (3) test_matched.tsv_tag_label: test dataset. The addresses were collected from the LEI DataBase.

To run GeoRoBERTa

For training, validation and prediction, use the command:

CUDA_VISIBLE_DEVICES=0 \
python geoRoBERTa_matcher.py \
--data_dir {data_dir} \
--task_name matching \
--train_batch_size 32 \
--max_seq_length 64 \
--RoBERTa_model {model_dir} \
--learning_rate 3e-5 \
--num_train_epochs 12 \
--do_train \
--do_eval \
--do_predict \
--output_dir {output_dir}

the meaning of the flags:

  • --data_dir : The path of Data Directory
  • --task : The name of the task (Matching)
  • --train_batch_size , --max_seq_length , --learning_rate , --num_train_epochs : The batch size, max sequence length, learning rate, and the number of epochs
  • --Roberta_model : The path of the pretrained RoBERTa model
  • --do_train , --do_eval , --do_predict : To run training, validation and prediction respectively.
  • --output_dir : The path of the Output Directory (Matching Results)