/llvm-ir-dataset-utils

Utilities for constructing a large dataset of LLVM IR

Primary LanguagePythonApache License 2.0Apache-2.0

LLVM-IR Dataset Utilities

This repository contains utilities to construct large LLVM IR datasets from multiple sources.

Getting Started

To get started with the dataset construction utilities, we'd suggest to use the packaged pipenv, or the packaged poetry to isolate the Python from your system isolation or other environments.

Pipenv

To get started with pipenv, you then have to

pipenv install

or if you seek to utilize the packaged lockfile

pipenv sync

After that you are ready to activate the environment, and install the dataset construction utilities into it

pipenv shell && pip install .

In case you want to develop the package, this becomes

pipenv shell && pip install -e .

Poetry

To get started with poetry, you then have to

poetry install

which will draw the exact software version from the packaged lockfile, and install the editable version of the dataset construction utilities into the environment. To only install the dependencies, you can run

poetry install --no-root

To then develop inside of poetry's virtual environment, we can launch a shell with

poetry shell

Creating First Data

To create your first small batch of IR data you then have to run from the root directory of the package

python3 ./llvm_ir_dataset_utils/tools/corpus_from_description.py \
  --source_dir=/path/to/store/dataset/to/source \
  --corpus_dir=/path/to/store/dataset/to/corpus \
  --build_dir=/path/to/store/dataset/to/build \
  --corpus_description=./corpus_descriptions_test/manual_tree.json

Beware! You'll need to have a version of llvm-objcopy on your $PATH. If you are missing llvm-objcopy, an easy way to obtain it is by downloading an llvm-release from either your preferred package channel such as apt, dnf or pacman, or build llvm from source where only the LLVM project itself needs to be enabled during the build, i.e. -DLLVM_ENABLE_PROJECTS="llvm".

You'll then receive a set of .bc files in /path/to/store/dataset/to/corpus/tree, which you can convert with llvm-dis into LLVM-IR, i.e. from inside of the folder

llvm-dis *.bc

Last steps into the dataloader to be described here.

Corpus Description

Basics of the corpus description to be outlined here to easily enable someone to point the package at a new source.

IR Sources

The package contains a number of builders to target the LLVM-based languages, and extract IR:

  • Individual projects (C/C++)
  • Rust crates
  • Spack packages
  • Autoconf
  • Cmake
  • Julia packages
  • Swift packages