CGCL-codes/naturalcc

Trouble attempting to reproduce results in examples/code-backdoor

Closed this issue · 3 comments

I'm attempting to run the commands shown under the CodeBERT section of the readme file in code-backdoor.
So far, I've been able to preprocess the training data, poison the dataset, generate test data for the backdoor attack, and finetune the Roberta model on the poisoned data. These are the files for the model I've created is shown below:

model_files

I'm now trying to run the following commands:

lang=python
idx=0
model=pattern_number_50_train

nohup python run_classifier.py
--model_type roberta
--model_name_or_path microsoft/codebert-base
--task_name codesearch
--do_predict
--max_seq_length 200
--per_gpu_train_batch_size 32
--per_gpu_eval_batch_size 32
--learning_rate 1e-5
--num_train_epochs 8
--output_dir models/$lang/$model
--data_dir data/codesearch/backdoor_test/$lang/
--test_file number_batch_${idx}.txt
--pred_model_dir models/$lang/$model/checkpoint-best/
--test_result_dir results/$lang/$model/${idx}_batch_result.txt > inference.log 2>&1 &

When I try to run these commands, however, I found that line 503 in run_classifier.py ties to import pytorch_model.bin from the checkpoint-last folder as a model if the checkpoint-last folder does exist. However, the model files that were generated by run_classifier when finetuning does not include this bin file, or any other model bin file in the checkpoint-last folder, which causes an error when trying to run the inference command above.

Any help on this issue is much appreciated.

Hi, this issue may be caused by the different version of transformers, adding a safe_serialization=False parameter in the save_pretrained function in the line 154, 175 and 559 in run_classifier.py file and retrain the model maybe helps.
model_to_save.save_pretrained(args.output_dir, safe_serialization=False)

Thanks! I'll try that.

It seems to be working.