Using HuggingFace Transformers to train a BERT with dynamic mask for Assemble Language from scratch. We name it Inst2Vec
for it is designed to generate vectors for assemble instructions.
It is a part of the model introduced in the ICONIP 2021 paper A Hierarchical Graph-based Neural Network for Malware Classification.
The preprocessing procedure can be found in process_data.
You can simply run python train_my_tokenizer.py
to obtain an Assemble Tokenizer.
The script I use to train the Inst2Vec1
model is as follows:
python my_run_mlm_no_trainer.py \
--per_device_train_batch_size 8192 \
--per_device_eval_batch_size 16384 \
--num_warmup_steps 4000 --output_dir ./ \
--seed 1234 --preprocessing_num_workers 32 \
--max_train_steps 150000 \
--eval_every_steps 1000