/distributed-kge-poplar

The application is a end-user training and evaluation system for standard knowledge graph embedding models. It was developed to optimise the WikiKG90Mv2 dataset

Primary LanguageC++MIT LicenseMIT

Distributed KGE (C++)

IPU implementation of a sharded knowledge graph embedding (KGE) model, implemented in Poplar for execution using DRAM on an IPU-POD16.

Note that this is a low-level implementation for advanced IPU usage.

See also: PyTorch KGE demo notebook.

Usage

First-time setup

  1. Ensure clang++ and ninja are installed.
  2. Clone this repository with --recurse-submodules.
  3. Install Poplar SDK and activate with source $POPLAR_SDK_DIR/enable.
  4. Create and activate a Python virtual environment.
  5. Install Python requirements pip install -r requirements-dev.txt
  6. Check everything is working by running ./dev (see also ./dev --help).

For example:

sudo apt-get install clang++ ninja
git clone --recurse-submodules REPO
source $POPLAR_SDK_DIR/enable
virtualenv -p python3 .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
./dev --help
./dev

Training

Our standard training script is in scripts/run_training.py. To build the core C++ code, add it to the path and run training,

./dev train

This trains a TransE model with embedding size 256.

Note:

  • Build and develelopment automation is provided by the ./dev script, which generates a ninja build file (build/build.ninja).
  • You may wish to change C++ compiler, e.g. env CXX=g++ ./dev ...
  • The training script expects the OGB WikiKG90Mv2 dataset to be downloaded to $OGBWIKIKG_PATH; see the OGB WikiKG90Mv2 page for instructions.

About

The application is a self-contained research platform for KGE models, using Poplar/PopLibs directly for execution on IPU, PyTorch for data loading and numpy for batching and interchange. Since model checkpoints would be very large, all training, evaluation and prediction tasks are run in a single job via run_training.py.

The main components are:

See also doc/design.md for a more detailed description of the design of the application.

Poplar remote buffers

We rely on Poplar's access to streaming memory in this code (see IPU memory architecture), which enables sparse access to a much larger memory store. This is accessed via the remote memory buffers API.

One implementation detail of interest is that we stack all remote embedding state (consisting of entity features, embeddings and optimiser state) into a single remote buffer, which helps to minimise memory overhead due to padding.

References & license

The included code is released under a MIT license (see LICENSE).

Copyright (c) 2022 Graphcore Ltd. Licensed under the MIT License

Our dependencies are:

Component Type About License
pybind11 submodule C++/Python interop library (github) BSD 3-Clause
Catch2 submodule C++ unit testing framework (github) Boost
OGB requirements.txt Open Graph Benchmark dataset and task definition (paper, website) MIT
PyTorch requirements.txt Machine learning framework (website) BSD 3-Clause
WandB requirements.txt Weights and Biases client library (website), for optional logging to wandb servers MIT

We also use ninja (website) with clang++ from LLVM (website) to build C++ code and additional Python dependencies for development/testing (see requirements-dev.txt).

The OGB WikiKG90Mv2 dataset is licenced under CC-0.