
Provides Kokkos performance portable parallel programming in Python.

PyKokkos is a framework for writing performance portable kernels in Python. At a high-level, PyKokkos translates type-annotated Python code into C++ Kokkos and automatically generating bindings for the translated C++ code. PyKokkos also makes use of Python bindings for constructing Kokkos Views.


Clone pykokkos-base and create a conda environment:

git clone https://github.com/kokkos/pykokkos-base.git
cd pykokkos-base/
conda create --name pyk --file requirements.txt
conda activate pyk

Once the necessary packages have been downloaded and installed, install pykokkos-base with CUDA and OpenMP enabled:


Other pykokkos-base configuration and installation options can be found in that project's README. Note that this step will compile a large number of bindings which can take a while to complete. Please open an issue if you run into any problems with pykokkos-base.

Once pykokkos-base has been installed, clone pykokkos and install its requirements:

cd ..
git clone https://github.com/kokkos/pykokkos.git
cd pykokkos/
conda install -c conda-forge pybind11 cupy patchelf
pip install --user -e .

Note that cupy is only required if CUDA is enabled in pykokkos-base.

To verify that pykokkos has been installed correctly, install pytest and run the tests:

conda install pytest
python runtests.py

Please open an issue for help with installation.


Hello World

PyKokkos provides decorators for marking kernel code. The following code snippet shows a kernel that prints hello world from each thread:

import pykokkos as pk

def hello(i: int):
    pk.printf("Hello, World! from i = %d\n", i)

Kernels definitions are marked with the @pk.workunit decorator. Each workunit requires an integer argument which represents the thread ID. This argument has to be type annotated using int.

This workunit can be called using the parallel_for function:

pk.parallel_for(10, hello)

PyKokkos will translate the workunit to C++ and Kokkos and compile it the first time it is called. Calling the same workunit again will skip the translation and compilation steps.


PyKokkos uses Views as its main n-dimensional array data structure. In Python, Views data behave as NumPy Arrays. The following snippet shows how a View can be created and some of the basic operations it supports:

v = pk.View([10], int) # create a 1D integer view of size 10
v.fill(0) # initialize v with zeros
v[0] = 10
print(v) # prints the contents of the view

Views and other primitive types can be passed to workunits normally. The following code snippet shows a workunit that adds a scalar to all elements of a view.

import pykokkos as pk

def add(i: int, v: pk.View1D[int], x: int):
    v[i] += x

if __name__ == "__main__":
    n = 10
    v = pk.View([n], int)

    pk.parallel_for(n, add, v=v, x=1)

As with the thread ID, arguments must be type annotated. They can the be passed via parallel_for using keyword arguments.


Workunits can also be defined as methods inside a functor. Functors are Python classes that contain one or workunits as methods. The following code snippet shows an example of a functor.

def Functor:
    def __init__(self, v, x):
        self.v: pk.View1D[int] = v
        self.x: int = x

    def add(self, i: int):
        self.v[i] += x

    def print(self, i: int):
        pk.printf("v[%d] = %d\n", i, self.v[i])

if __name__ == "__main__":
    n = 10
    v = pk.View([n], int)

    f = Functor(v, 1)
    pk.parallel_for(n, f.add)
    pk.parallel_for(n, f.print)

Workunits defined in functors only include the thread ID argument in their definition. Instead of arguments, they access Views and other primitive types as member variables. These member variables must be defined in the constructor __init__ with type annotations. This has the benefit of avoiding repetition of the same type annotations across multiple non-method workunits.

To call these workunits, the functor class must first be instantiated. Individual workunits are called using parallel_for by passing in the workunit method as an argument. The member variables will hold the values the functor instance contains at the time parallel_for is called.

Other Examples

The following table shows a list of other PyKokkos examples, as well as their corresponding C++ Kokkos implementations:

parallel_reduce PyKokkos Kokkos
Cuda PyKokkos Kokkos
team_policy PyKokkos Kokkos
team_vector_loop PyKokkos Kokkos
subview PyKokkos Kokkos
mdrange PyKokkos Kokkos
nstream PyKokkos Kokkos
stencil PyKokkos Kokkos
transpose PyKokkos Kokkos
ExaMiniMD PyKokkos Kokkos


PyKokkos has only been tested on Ubuntu with GCC 7.5.0 and NVCC 10.2. Support for other platforms and compilers is currently experimental. For help with setup and installation on please open a GitHub issue.


Nader Al Awar (nader.alawar@utexas.edu)

Steven Zhu (stevenzhu@utexas.edu)


If you have used PyKokkos in a research project, please cite this research paper:

  author = {Al Awar, Nader and Zhu, Steven and Biros, George and Gligoric, Milos},
  title = {A Performance Portability Framework for Python},
  booktitle = {International Conference on Supercomputing},
  pages = {467-478},
  year = {2021},


This project is partially funded by the U.S. Department of Energy, National Nuclear Security Administration under Award Number DE-NA0003969 (PSAAP III).