/MegaVul

MegaVul - The largest, high-quality, extensible, continuously updated, C/C++/Java vulnerability dataset

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

MegaVul Logo

MegaVul ๐Ÿ“ฆ (Paper)

The largest, high-quality, extensible, continuously updated, C/C++/Java function-level vulnerability dataset.

Note

MegaVul begins providing vulnerability data for Java

With over 17,000 identified vulnerable functions and 320,000 non-vulnerable functions extracted from 9,000 vulnerability fix commits, MegaVul provides multi-dimensional data to help you train state-of-the-art sequence-based or graph-based vulnerability detectors.

Table of Contents


Getting Started

We offer three versions of the pre-crawled MegaVul, as well as providing Joern graphs extracted from all functions.

The differences between the three versions are as follows:

  • cve_with_graph_abstract_commit.json Raw dataset with complete hierarchical structure. It includes information such as CVE, Commit, :wqFile, Functions, etc.
  • megavul.json is a version of cve_with_graph_abstract_commit after flattened, for easier use. Keep all fields but losing the hierarchical structure.
  • megavul_simple.json is a simple version of megavul.json, designed to provide a more concise representation of the dataset. It retains essential fields such as Functions and CVE IDs while omitting detail information like function parameter lists and commit message.

The megavul_graph.zip provides Joern graphs for all functions in the MegaVul, including node and edge information. (Mostly used for graph-based vulnerability detect neural networks) It is provided separately to save bandwidth and storage space (unzipping requires around 20GB of free space), and only about 87% of functions successfully generate graphs.

๐Ÿ”— Download Dataset

Download from Cloud Drive

  1. cve_with_graph_abstract_commit.json
  2. megavul.json
  3. megavul_simple.json
  4. megavul_graph.zip (Linux command to uncompressed unzip megavul_graph.zip -d /path/to/graph)

โฉ Simple UseCase

Refer specification for more information about the fields in the dataset.

More code examples can be found in the examples folder.

The following code reads megavul_simple.json

import json
from pathlib import Path
graph_dir = Path('../megavul/storage/result/c_cpp/graph')

with Path("../megavul/storage/result/c_cpp/megavul_simple.json").open(mode='r') as f:
    megavul = json.load(f)
    item = megavul[9]
    cve_id = item['cve_id'] # CVE-2022-24786
    cvss_vector = item['cvss_vector']   # AV:N/AC:L/Au:N/C:P/I:P/A:P
    is_vul = item['is_vul'] # True
    if is_vul:
        func_before = item['func_before']  # vulnerable function

    func_after = item['func']   # after vul function fixed(i.e., clean function)
    abstract_func_after = item['abstract_func']

    diff_line_info = item['diff_line_info'] # {'deleted_lines': ['pjmedia_rtcp_comm .... ] , 'added_lines': [ .... ] }
    git_url = item['git_url']   # https://github.com/pjsip/pjproject/commit/11559e49e65bdf00922ad5ae28913ec6a198d508

    if item['func_graph_path_before'] is not None: # graphs of some functions cannot be exported successfully
        graph_file_path = graph_dir / item['func_graph_path_before']
        graph_file = json.load(graph_file_path.open(mode='r'))
        nodes, edges = graph_file['nodes'] , graph_file['edges']
        print(nodes)    # [{'version': '0.1', 'language': 'NEWC', '_label': 'META_DATA', 'overlays': ....
        print(edges)    # [{'innode': 196, 'outnode': 2, 'etype': 'AST', 'variable': None}, ...]

๐Ÿ› Crawling From Scratch

๐Ÿ’กPrerequisites

Install dependencies manually

Install python environment

You can choose one of the following three methods to install python dependencies.

  • Install from conda (Recommended)
conda env create -f environment.yml
  • Direct installation into existing environments
pip install -r requirements.txt
  • Install from python venv
python -m venv .megavul-env
source .megavul-env/bin/activate
pip install -r requirements.txt

Install other dependencies

# install node.js and tree-sitter
sudo apt install -y curl 
curl -sL https://deb.nodesource.com/setup_19.x | sudo -E bash -
sudo apt install -y nodejs 
npm i tree-sitter-cli
which node && which tree-sitter 

# install java,scala and sbt
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk version
sdk install java 17.0.6-amzn
sdk install scala 3.2.2
sdk install sbt 1.8.2
which java && which scala && which sbt 
 
# install ruby, ruby-gem, linguist
sudo apt-get install build-essential cmake pkg-config libicu-dev zlib1g-dev libcurl4-openssl-dev libssl-dev ruby-dev
gem install github-linguist
which github-linguist

Docker image

We provide out-of-box docker image, pull it and run MegaVul straight away!

docker pull icyrockton/megavul
docker run -it icyrockton/megavul
(megavul) root@8345683f69d9:/MegaVul#: vim megavul/config.yaml
(megavul) root@8345683f69d9:/MegaVul#: vim megavul/github_token.txt
(megavul) root@8345683f69d9:/MegaVul#: python megavul/main.py

Config file preparations

Configuration items need to be filled in megavul/config.yaml and megavul/github_token.txt.

Generate GitHub RESTful token

  1. https://github.com/settings/tokens
  2. Generate new token(classic)
  3. No scope needs to be checked, fill in the name
  4. Directly generate a token starting with ghp_xxxx or gho_xxxx.

A sample config.yaml file is as follows

proxy:
  enable: false
  http_url: http://127.0.0.1:7890
  https_url:  http://127.0.0.1:7890

dependencies:
  java: /home/tom/.sdkman/candidates/java/current/bin
  scala: /home/tom/.sdkman/candidates/scala/current/bin
  sbt: /home/tom/.sdkman/candidates/sbt/current/bin
  node: /usr/local/node/bin
  tree-sitter: /usr/local/tree-sitter
  github-linguist: /usr/local/bin/github-linguist

crawling_language:
  c_cpp  # [c_cpp, java]

log_level:
  INFO # [DEBUG, INFO, WARNING, ERROR]

Create a empty file named github_token.txt and fill it with all github tokens (one line one token)

Sample file:

ghp_xxxx11111
ghp_xxxx22222

๐Ÿš€ Run the pipelines

pipeline overview

Install megavul as a python module

pip install -e .

Runs the dataset collection pipelines for MegaVul

python megavul/main.py

โ˜• Have a cup of coffee and wait for dataset collection to complete (8 hours+).

Intermediate json results are stored into ./megavul/storage/result, ./megavul/storage/cache.


๐Ÿ› ๏ธ Extend More

Add more datasource

If you find that some CVEs referenced link website contain potential commits, you can add the parsed commit URLs to mining_commit_urls_from_reference_urls.

Add more git platform

All git platforms inherit from GitPlatformBase, and you need to implement all of its methods to extend a new git platform.

Tree-sitter enhance

We have extended the grammar of tree-sitter to recognize more C/C++ macros (e.g. asmlinkage, UNUSED) from other projects such as linux.

The modified tree-sitter grammar file can be found here: grammar.js.

Add more language

Our function separator depends on tree-sitter, such as ParserC.

If you want to extend function separator for more languages, such as Java you can use tree-sitter-java to extend ParserBase.

๐Ÿ“š Appendix

MegaVul Statistics

MegaVul(C/CPP) MegaVul(Java) Big-Vul
Number of Repositories 1062 362 310
Number of CVE IDs 8476 775 3539
Number of CWE IDs 176 115 92
Number of Commits 9288 902 4058
Number of Vul/Non-Vul Function 17975/335898 2433/39516 10900/177736
Success Rate of Graph Generation 87% 100% None

Updated: 2024/04

Specification

For dataset specification and graph specification, please refer to SPECIFICATION.md

Citation

If you use MegaVul for your research, please cite our MSR(2024) paper:

@InProceedings{
}

License

MegaVul is licensed under the GPL 3.0, as found in the LICENSE file.