/cusadi

Primary LanguagePythonMIT LicenseMIT


CusADi

Parallelizing symbolic expressions from CasADi on the GPU

parallel_MPC.mp4

Evaluating MPC in parallel for thousands of instances of the MIT Humanoid. CusADi efficiently parallelizes functions for applications such as reinforcement learning, parallel simulation, and sensitivity analysis.

Overview and more videos: https://www.youtube.com/watch?v=NxeujmgcEL4

ArXiv: https://arxiv.org/abs/2408.09662


If you use this work, please use the following citation:

@misc{jeon2024cusadigpuparallelizationframework,
      title={CusADi: A GPU Parallelization Framework for Symbolic Expressions and Optimal Control}, 
      author={Se Hwan Jeon and Seungwoo Hong and Ho Jae Lee and Charles Khazoom and Sangbae Kim},
      year={2024},
      eprint={2408.09662},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2408.09662}, 
}

Details of our work can be found in the paper:

Table of Contents
  1. About
  2. Getting Started
  3. Usage
  4. Extensions
  5. License
  6. Contact

About The Project

cusadi is a framework for parallelizing arbitrary symbolic functions on the GPU. It extends the graph expression structure of casadi by generating and compiling CUDA kernels for efficient evaluation of thousands of function instances. casadi is particularly well-suited for this because of it is able to exploit sparsity when building its expression graphs, ensuring efficient computation.


The CasADi expression graph is evaluated by iterating through instructions and performing each operation on scalar values. CusADi can exploit this same structure and evaluate the function in parallel with the GPU by vectorizing each operation to act on tensors of data instead of scalars.

We demonstrate using cusadi for several robotics applications, including parallelizing MPC, augmenting RL with parallelized dynamics algorithms, creating parallel simulations, and running parameters sweeps. Benchmarks show that cusadi offers significant speedups, especially for reinforcement learning applications where significant overhead is incurred from CPU-GPU data transfer.


(Left): Speedups from CusADi relative to serial CPU evaluation of five functions, each increasing in complexity.
(Right): Speedups from CusADi relative to serial CPU evaluation, including overhead time from CPU-GPU data transfer.

(back to top)

Getting Started

Dependencies

cusadi was built on the following. Other versions may work, but are untested.

Installation

  1. Clone this repository (standalone, or into a larger project)
    git clone https://github.com/se-hwan/cusadi
    
  2. (Optional) Setup a virtual environment for required Python dependencies
    python -m venv .cusadi_venv         # Python virtual environment
    source .cusadi_venv/bin/activate
    
  3. Install cusadi. From the root of the cloned repository, run:
    pip install -e .
    
  4. Compile the test function for parallelization.
    python run_codegen.py --fn=test
    
  5. Evaluate the parallelized function for accuracy. The error should be ~1e-10 or smaller. If successful, then cusadi is ready for use!
    python run_cusadi_function_test.py --fn=test
    

(back to top)

Usage

  1. Define some symbolic casadi function for parallelization. This function could be the dynamics of a system, a value iteration update, a controller, etc., but do not need to be limited to optimal control applications. There are many examples and tutorials available online:

    • https://web.casadi.org/docs/
    • https://folk.ntnu.no/vladimim/#/6
    • https://www.syscop.de/files/2022ws/numopt/ex1.pdf
    • In our case, we'll parallelize the dynamics of a pendulum as a trivial example. casadi is available in C++, MATLAB, and Python, but we'll do this example in MATLAB
      % Add casadi to MATLAB path. Do this for wherever the casadi folder is downloaded from https://web.casadi.org/get/
      addpath(genpath('[CASADI_FOLDER_LOCATION]'));
      import casadi.*
      
      % Symbolic expressions
      x_pend = casadi.SX.sym('x_pend', 2, 1);             % pendulum state
      g = casadi.SX.sym('g', 1, 1);                       % pendulum parameters, gravity and length
      l = casadi.SX.sym('l', 1, 1);
      dt = casadi.SX.sym('dt', 1, 1);                     % simulation timestep
      
      f_pend = [x_pend(2); -g*sin(x_pend(1))/l];          % pendulum dynamics
      J_pend = jacobian(f_pend, x_pend);                  % Jacobian of pendulum dynamics w.r.t the state
      omega_next = x_pend(2) - (g*sin(x_pend(1))/l)*dt    % Semi-implicit Euler integration of dynamics
      theta_next = x_pend(1) + omega_next*dt
      x_next_pend = [theta_next; omega_next];
      
      % Export and save as casadi functions
      % [casadi_expr] = casadi.Function('[fn_name]', {[input1, input2, ...]}, {[output1, output2, ...]})
      fn_dynamics = casadi.Function('fn_dynamics', {x_pend, g, l}, {f_pend});
      fn_jacobian = casadi.Function('fn_jacobian', {x_pend, g, l}, {J_pend});
      fn_sim_step = casadi.Function('fn_sim_step', {x_pend, g, l, dt}, {x_next_pend});
      
      fn_dynamics.save('fn_dynamics.casadi')
      fn_sim_step.save('fn_sim_step.casadi')
      fn_jacobian.save('fn_jacobian.casadi')
      
  2. Move the saved functions to src/casadi_functions of the cusadi directory.

  3. Compile the functions for parallelization. From the root directory of cusadi:

    python run_codegen.py --fn=fn_dynamics
    python run_codegen.py --fn=fn_sim_step
    python run_codegen.py --fn=fn_jacobian
    
  4. Evaluate the parallelized functions with cusadi in PyTorch

    import torch
    from cusadi import *
    from casadi import *
    
    BATCH_SIZE = 10000
    
    x0 = torch.rand((BATCH_SIZE, 2), device='cuda', dtype=torch.double)                 # Random initial states
    g = 9.81 * torch.ones((BATCH_SIZE, 1), device='cuda', dtype=torch.double)           # Gravity for each env.
    l = torch.rand((BATCH_SIZE, 1), device='cuda', dtype=torch.double)                  # Random lengths for each env.
    dt = torch.linspace(0.001, 0.1, BATCH_SIZE, device='cuda', dtype=torch.double)      # Varying timestep for each env.
    
    fn_casadi_sim_step = casadi.Function.load(os.path.join(CUSADI_FUNCTION_DIR, "fn_sim_step.casadi"))
    fn_cusadi_sim_step = CusadiFunction(fn_casadi_sim_step, BATCH_SIZE)
    fn_cusadi_sim_step.evaluate(x0, g, l, dt)           # Evaluate fn. with CUDA kernel 
    x_next = fn_cusadi_sim_step.outputs_sparse[0]       # Access results.
    
  5. With this example, by putting the evaluate() call in a for loop, a parallel simulator can be created. You can quickly sweep the effect of the parameters (l and g) or timestep (dt) on the system as well.

(back to top)

Extensions

  • Release 1.0
  • Support for JAX
  • Interface with cuBLAS, cuSPARSE
  • Explore CPU parallelism opportunities
  • Streamline exporting, saving, and compilation flow
  • IsaacGym/Orbit/IsaacLab examples
  • Public examples for parallelized MPC, other optimal controllers (coming soon!)

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Please feel free to reach out with any questions!

Se Hwan Jeon - sehwan@mit.edu

(back to top)