Our architecture

1. Feature Extraction

2. Preprocessing

Remove unnecessary characters/strings: double quotes
Remove unnecessary key: hash function (MD5, SHA1, SHA256), filename, …
Remove VirusTotal report

3. Our model

3.1 Embedding layer

Reference: Bengio, Yoshua, Réjean Ducharme, and Pascal Vincent. "A neural probabilistic language model." Advances in neural information processing systems 13 (2000).

Evaluation

Environment

Name	Specification
Service	Google Colab (Pro)
GPU	T4
RAM	64 GB (Recommend)

Accuracy & False rate (%)

Model\Metric	Accuracy	Precision	Recall	F1-Score	FPR	FNR
SimpleCNN-GRU	79.19	78.28	99.86	87.76	100.00	0.13
Standard CNN	73.86	77.36	94.18	84.94	99.50	5.81
CNN-BiLSTM	65.72	79.94	75.06	77.42	68.00	24.93
Our model	99.02	100.00	98.75	99.37	0.00	1.24

Training time (min,sec) & Model size (MB)

Model\Metric	Training Time	Model Size
SimpleCNN-GRU	15 min 28 sec	12.4
Standard CNN	14 min 46 sec	14.5
CNN-BiLSTM	24 min 59 sec	7.23
Our model	42 min 39 sec	552

Demo Section

See Android Detection Website (video + source code)

khangtictoc/Thesis.Text_base_Android_malware_classification.Model