/npm-malware-detect

Artifact accompanying our ICSE '22 paper "Practical Automated Detection of Malicious npm Packages"

Primary LanguagePythonMIT LicenseMIT

Artifact: Practical Automated Detection of Malicious npm Packages

This is the artifact for our ICSE '22 paper "Practical Automated Detection of Malicious npm Packages", which presents an approach to automatically detecting malicious npm packages based on a combination of three components: machine-learning classifiers trained on known samples of malicious and benign npm packages; a reproducer for identifying packages that can be rebuilt from source and hence are unlikely to be malicious; and a clone detector for finding copies of known malicious packages.

We would like to claim an Artifact Available badge, and hence make this data publicly available at https://github.com/githubnext/amalfi-artifact. No specific technology skills are required to use this data. There are no external dependencies, and no setup is required.

The artifact contains the code for training the classifiers, reproducing packages from source and detecting clones; a description of the samples used for initial training; as well as input data and results for the two experiments reported in the paper: classifying and retraining on newly published packages over the course of one week (Section 4.1), and classifying manually labeled packages (Section 4.2). We further explain where to find this data in the repository below.

The artifact does not contain the feature-extraction code, the contents and features of the training samples, the trained classifiers, and the contents and features of the samples considered in our experiments. We further explain why these could not be included below.

What is included

Code for training classifiers

This is implemented as a Python script code/trainining/train_classifier.py. Invoking the script with the --help option prints out an explanation of the various supported command-line flags. Note that this code is for reference purposes only and cannot be used to replicate our results, since it needs as its input features for the samples comprising the training set, which are not included in the artifact as explained below.

Code for reproducing packages

The reproducer is implemented as a Shell script code/reproducer/reproduce-package.sh that, given a package name and a version, uses an auxiliary script code/reproducer/build-package.sh to rebuild the package from source, and then compares the result to the published package.

Code for detecting clones

The clone detector is implemented as a Python script code/clone-detector/hash_package.py which computes an MD5 hash for a package.

Description of basic corpus

The CSV file data/basic-corpus.csv lists information about the samples constituting the basic corpus our classifiers were trained on (Section 3.3). For each sample, it contains the package name and version number of the npm package it corresponds to, the hash of the sample (computed as described in Section 3.4), and an analysis label indicating whether the sample is malicious or benign.

Input data for experiments

The CSV files data/july-29.csv to data/august-4.csv list information about the samples considered in Experiment 1, corresponding to all new package versions published to the npm registry that day, excluding private packages. The format is the same as for the training set, but samples that were not manually reviewed are labeled as "not triaged".

Taken together, these files total about 8MB of data.

Results of experiments

The directory results/slide-window contains the results of Experiment 1, again in a series of CSV files named july-29.csv to august-4.csv. For each day, it lists each sample that was labeled as malicious by at least one classifier or the clone detector. For each sample, we again list package name, version, and hash as above; whether the sample was reproducible from source by the reproducer; whether the sample was found to be malicious or not by manual analysis, and whether each of the classifiers (decision-tree, naive-bayes, svm, hash) labeled it as malicious.

The directory results/cross-validation contains the result from the 10-fold cross-validation on our basic corpus data performed as part of Experiment 2, with one subdirectory per fold. For each fold, there are three TSV files, one per classifier, with three columns: package name, package version, and the label assigned by the classifier.

Finally, the directory results/maloss contains the results from running our classifiers on the MalOSS dataset from Duan et al.'s paper "Towards Measuring Supply Chain Attacks on Package Managers for Interpreted Languages". As for the cross-validation experiment, there is one TSV file per classifier, with the same three columns as above.

Taken together, these files total less than 1MB of data.

Performance measurements

The directory results/timing has logs of the time it took for the different stages of Experiment 1. The files results/timing/extract_features_time.csv and results/timing/extract_diffs_time.csv list the timings for extracting the features and the difference of features between versions, respectively, for ~500 random packages. The subdirectories each contain the times for training (directory training) amd prediction (directory prediction) for each classifier.

The files amount to about 6MB.

What is not included

Contents of samples

We were not able to include the contents of the samples in our basic corpus or the samples considered in Experiment 1, since some of them contain malicious and harmful code.

Features of samples

We were not able to include the features extracted from the samples either. Our approach might be deployed in production at some future date, and we do not want to give a prospective attacker any support in reverse-engineering our technique so as to avoid detection.

Code for extracting features

For the same reason, we were not able to include the feature-extraction code.

Trained classifiers

Finally, the classifiers trained on the basic corpus and as part of Experiment 1 can, unfortunately, also not be made public, again due to concerns about abuse by malicious parties.