Magnum-NLC2CMD: A Python repository from cohmoti

Requirements

Show details

numpy
six
nltk
experiment-impact-tracker
scikit-learn
pandas
flake8==3.8.3
spacy==2.3.0
tb-nightly==2.3.0a20200621
tensorboard-plugin-wit==1.6.0.post3
torch==1.6.0
torchtext==0.4.0
torchvision==0.7.0
tqdm==4.46.1
OpenNMT-py==2.0.0rc2

How it works

Environment

Create a virtual environment with python3.6 installed(virtualenv)
git clone --recursive https://github.com/magnumresearchgroup/Magnum-NLC2CMD.git
use pip3 install -r requirements.txt to install the two requirements files.

Data pre-processing

Run python3 main.py --mode preprocess --data_dir src/data --data_file nl2bash-data.json and cd src/model && onmt_build_vocab -config nl2cmd.yaml -n_sample 10347 --src_vocab_threshold 2 --tgt_vocab_threshold 2 to process raw data.
You can also download the Original raw data here

Train

cd src/model && onmt_train -config nl2cmd.yaml
Modify the world_size in src/model/nl2cmd.yaml to the number of GPUs you are using and put the ids as gpu_ranks.
You can also download one of our pre-trained model here

Inference

onmt_translate -model src/model/run/model_step_2000.pt -src src/data/invocations_proccess_test.txt -output pred_2000.txt -gpu 0 -verbose

Evaluate

python3 main.py --mode eval --annotation_filepath src/data/test_data.json --params_filepath src/configs/core/evaluation_params.json --output_folderpath src/logs --model_dir src/model/run --model_file model_step_2400.pt model_step_2500.pt

You can change the gpu=-1 in src/model/predict.py to gpu=0, and replace the code in src/model/predict.py accordingly with the following code for faster inference time

Show details

invocations = [' '.join(tokenize_eng(i)) for i in invocations]
translated = translator.translate(invocations, batch_size=n_batch)
commands = [t[:result_cnt] for t in translated[1]]
confidences = [ np.exp( list(map(lambda x:x.item(), t[:result_cnt])) )/2 for t in translated[0]]
for i in range(len(confidences)):
    confidences[i][0] = 1.0

Metrics

Accuracy metric

𝑆𝑐𝑜𝑟𝑒(𝐴(𝑛𝑙𝑐))=max𝑝∈𝐴(𝑛𝑙𝑐)𝑆(𝑝) if ∃𝑝∈𝐴(𝑛𝑙𝑐) such that 𝑆(𝑝)>0;

𝑆𝑐𝑜𝑟𝑒(𝐴(𝑛𝑙𝑐))=1|𝐴(𝑛𝑙𝑐)|∑𝑝∈𝐴(𝑛𝑙𝑐)𝑆(𝑝) otherwise.

Reproduce

We used 2x Nvidia 2080Ti GPU + 64G memory machine running Ubuntu 18.04 LTS
Change the batch_size in nl2cmd.yaml to the largest your GPU can support without OOM error
Train multiple models by modify seed in nl2cmd.yaml, you should also modify the save_model to avoid overwrite existing models.
Hand pick the best performed ones on local test set and put their directories in the main.py

References

Acknowledgment

This work was supported in part by NSF Award# 1552836, At-scale analysis of issues in cyber-security and software engineering.

License

See the LICENSE file for license rights and limitations (MIT).

cohmoti/Magnum-NLC2CMD