/halutmatmul

Hashed Lookup Table based Matrix Multiplication (halutmatmul) built on MADDness/bolt

Primary LanguagePythonMIT LicenseMIT

Halutmatmul

Algorithmic CI

GPU Tests (Vast.ai) PyTest Linting MyPy C++ build

Hardware CI

HW Synth + PAR OpenROAD RTL Linting HW Design Verification

General Information

This repo is used for the algorithmic exploration. I will try to update this repo with as much hardware information as I am allowed to publish.

Install

# install conda environment & activate
conda env create -f environment_gpu.yml
conda activate halutmatmul

# IIS prefixed env
conda env create -f environment_gpu.yml --prefix /scratch/janniss/conda/halutmatmul_gpu

# install CLI
./scripts/install-cli.sh

# now use CLI with
halut --help

# or without install
./halut --help

Hackernews mention (comments only) and discussion

Hardware OpenROAD flow results

All Designs ASAP7 NanGate45
All Report All All
History History History

Total Circuit (M=2)

halut_matmul ASAP7 NanGate45
Area [μm^2] 9643.6787 140647.7656
Freq [Mhz] 666.7 333.3
GE 110.238 kGE 176.25 kGE
Std Cell [#] 68186 68994
Voltage [V] 0.77 1.1
Util [%] 45.0 59.2
TNS -1086.59 -0.31
Clock Net Clock_net Clock_net
Gallery Gallery Viewer Gallery Viewer
Metrics Metrics Viewer Metrics Viewer
Report Report Viewer Report Viewer

Encoder

halut_encoder_4 ASAP7 NanGate45
Area [μm^2] 4844.5405 69711.9531
Freq [Mhz] 666.7 333.3
GE 55.378 kGE 87.358 kGE
Std Cell [#] 34334 33746
Voltage [V] 0.77 1.1
Util [%] 45.0 58.7
TNS 0.0 0.0
Clock Net Clock_net Clock_net
Gallery Gallery Viewer Gallery Viewer
Metrics Metrics Viewer Metrics Viewer
Report Report Viewer Report Viewer

Decoder

halut_decoder ASAP7 NanGate45
Area [μm^2] 4749.8286 68923.7891
Freq [Mhz] 666.7 333.3
GE 54.296 kGE 86.37 kGE
Std Cell [#] 33709 34395
Voltage [V] 0.77 1.1
Util [%] 44.4 58.9
TNS -11340.5098 -0.66
Clock Net Clock_net Clock_net
Gallery Gallery Viewer Gallery Viewer
Metrics Metrics Viewer Metrics Viewer
Report Report Viewer Report Viewer

Progress Slides

Slides preview

CUDA kernels

I am aware that there is still a lot that could be optimized here (warp etc.), but it was only developed for fast analysis

Results

Caveats: No retraining and fine-tuning done yet!

Single Layer replacement with C=32 and K=16

LeViT (Source)

SOTA Vision Transformer on ImageNet 1K LeViT Results

ResNet-50 (only interesting layers in analysis)

Legacy Classifier on ImageNet 1K ResNet-50 Results

Depthwise seperable CNN

on Google Speech v2 DS-CNN Results

C, K and encoding_algorithm parameter sweep for ResNet-50

Data visualizer

Offline learning convergence on ResNet-50

The goal was to find out how much offline training data is needed to get the maximum accuracy.

ResNet-50 Convergence Results

Formalism

Some definitions about the forward path.

Encode kernel

Read and accumulate LUTs kernel

Links