/MalPacDetector-core

The source code of MalPacDetector for npm.

Primary LanguageTypeScript

MalPacDetector

This repository hosts dataset MalnpmDB and malicious package detector MalPacDetector involved in the paper MalPacDetector: An LLM-based Malicious npm Package Detector.

Requirements

Environment

  • Operating System: Ubuntu 22.04
  • Python: Python 3.10.12
  • node.js: node.js v18.16.0

Setup

$ python3 configure.py

Follow the tooltips to configure the project. You can configure:

  • datasets path: Where to find npm packages. (default: datasets/MalnpmDB)
  • models path: Where to save trained models. (default: models)
  • reports path: Where to save prediction result reports. (default: reports)
  • features: Where to save extracted features. (default: features)
  • feature-positions: Where to save code line position information of extracted features. (default: feature-positions)

And, then use the following command to setup the project.

$ ./setup.sh

Once you setup the project, you will see the following folders:

  • conf: containing configuration and settings files.
  • datasets: containing MalnpmDB dataset.
  • feature-extract: containing feature extraction code files.
  • training: containing training and prediction code files.

If you using default configuration, you will see the following folders as well:

  • models: containing trained machine learning models.
  • reports: containing npm packages prediction reports.
  • features: containing npm packages' features extracted by feature extractor.
  • feature-positions: containing feature position information .

Usage

At first, you should activate python virtual environment:

$ source env/bin/activate

And there is a main python script file:

  • cli.py: for training a machine learning model and predicting npm packages. By specifying different paramaters, users can training different models or predicting different packages.

The paramaters available for performing a training or predicting task, which are listed below:

Options Description
-h Show all help information.
extract Extract features.
-h Show help information about extracting features.
-d npm dataset name.
train Train model.
-h Show help information about training models.
-m Malicious npm dataset name.
-b Benign npm dataset name.
-o Model used to train. ("NB", "MLP", "RF", "SVM")
-p Preprocess method. ("none", "standardlize", "min-max-scale")
-a Trainging or saving model. (training, save)
-hs smoothing of NB to save.
-hr Learning rate of MLP to save.
-hl Number of layers of MLP to save.
-hi Number of iterations of MLP to save.
-ho Optimization algorithm of MLP to save.
-ha Activation funtion of MLP to save.
-he Number of decision trees of RF to save.
-hd Maxium depth of RF to save.
-hg Gamma of SVM to save.
-hc C of SVM to save.
predict Predict npm packages.
-h Show help information about predicting npm pacakges.
-o Model used to predict.
-d npm dataset which stored gzip formatted npm packages.
-p npm package directory path.

For convenience, use the following command to show help information.

# Show all help information.
$ python3 cli.py -h

# Show help information about extracting features.
$ python3 cli.py extract -h

# Show help information about training models.
$ python3 cli.py train -h

# Show help information about predicting npm dataset.
$ python3 cli.py predict -h

Step 1: Extract features from npm dataset

The paramater related to model settings are presented in above table's field extract. The npm dataset should obey the following structure:

dataset_name
|__ <package_name-package_version1>.tar.gz
|__ <package_name-package_version2>.tar.gz
|__ ...
|__ <package_name-package_versionn>.tar.gz

The compressed package should have the following structure which is the formal npm structure:

package_name-package_version
|__ package
   |__ package.json
   |__ ...

Use the following command to extract features from npm dataset.

$ python3 cli.py extract -d <dataset_name>

Step 2: Train a classifier

The paramater related to model settings are stored in conf/settings.json, and are presented in above table's field train. This allows user to conveniently train different models or use different datasets.

Use the following command to train a classifier.

$ python3 cli.py train -a training -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name>

Step 3: Save the classifier

The paramater related to model settings are stored in conf/settings.json, and are presented in above table's field train.

Use the following command to train a classifier.

# NB
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hs <smoothing>

# MLP
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hr <learning_rate> -hl <number_of_layers> -hi <number_of_iterations> -ho <optimization_algorithm> -ha <activation_function>

# RF
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -he <number_of_decision_trees> -hd <maxium_depth>

# SVM
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hg <Gamma> -hc <C>

Step 4: Predict npm packages

The paramater related to model settings are presented in above table's field predict.

Use the following command to predict packages.

$ python3 cli.py predict -o <model_name> -d <dataset_name>

For convenience, you can just use one command to pass above steps to predict a single package.

$ python3 cli.py predict -o <model_name> -p <package_path>

Hyperparameters

Hyperparameter values of the 4 classifiers, where boldface means the best hyperparameter value of the model.

Model Hyperparameter
NB Smoothing terms: (1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4)
MLP Learning rate: 5 values randomly selected from a uniform distribution with the interval [0.01, 0.2] (0.0505)
Number of hidden units: (16, 32, 100, 150)
Number of iterations: (400, 600)
Optimization algorithm: (lbfgs, adam)
RF Number of decision trees: (16, 32, 64, 100, 128, 256, 512)
Maximum depth: (3, 5, 7, 11, 15)
SVM Gamma: (scale, auto, 3 values randomly selected from a normal distribution with mean 0.2 and standard deviation 0.075) (scale)
C: 3 values randomly selected from a uniform distribution with the [0.5, 2.0] (1.0704)

Dataset and Results

  • Dataset: Containing malicious dataset mal and benign dataset ben in datasets/MalnpmDB which has 3258 and 4051 packages respectively.
  • Training and Validation Results: Model training and validation results are stored in trainging/result directory, which named ***_validation.csv, where *** represents model name.

Contact

Since the paper not having been published, and for security reasons, we can't place the malicious package dataset here. If you need the dataset, please send a request to hust_jianw@hust.edu.cn.

Any bug report or improvement suggestions will be appreciated. Please kindly cite our paper if you use the code or data in your work.

Thanks!