Halutmatmul

Algorithmic CI

Hardware CI

General Information

Based on MADDness/Bolt.
More information about the base project is here
arXiv paper link

This repo is used for the algorithmic exploration. I will try to update this repo with as much hardware information as I am allowed to publish.

Install

# install conda environment & activate
conda env create -f environment_gpu.yml
conda activate halutmatmul

# IIS prefixed env
conda env create -f environment_gpu.yml --prefix /scratch/janniss/conda/halutmatmul_gpu

# install CLI
./scripts/install-cli.sh

# now use CLI with
halut --help

# or without install
./halut --help

Hackernews mention (comments only) and discussion

HN: Bolt: Faster matrix and vector operations that run on compressed data

Hardware OpenROAD flow results

All Designs	ASAP7	NanGate45
All Report	All	All
History	History	History

Total Circuit (M=2)

halut_matmul	ASAP7	NanGate45
Area [μm^2]	9643.6787	140647.7656
Freq [Mhz]	666.7	333.3
GE	110.238 kGE	176.25 kGE
Std Cell [#]	68186	68994
Voltage [V]	0.77	1.1
Util [%]	45.0	59.2
TNS	-1086.59	-0.31
Clock Net
Gallery	Gallery Viewer	Gallery Viewer
Metrics	Metrics Viewer	Metrics Viewer
Report	Report Viewer	Report Viewer

Encoder

halut_encoder_4	ASAP7	NanGate45
Area [μm^2]	4844.5405	69711.9531
Freq [Mhz]	666.7	333.3
GE	55.378 kGE	87.358 kGE
Std Cell [#]	34334	33746
Voltage [V]	0.77	1.1
Util [%]	45.0	58.7
TNS	0.0	0.0
Clock Net
Gallery	Gallery Viewer	Gallery Viewer
Metrics	Metrics Viewer	Metrics Viewer
Report	Report Viewer	Report Viewer

Decoder

halut_decoder	ASAP7	NanGate45
Area [μm^2]	4749.8286	68923.7891
Freq [Mhz]	666.7	333.3
GE	54.296 kGE	86.37 kGE
Std Cell [#]	33709	34395
Voltage [V]	0.77	1.1
Util [%]	44.4	58.9
TNS	-11340.5098	-0.66
Clock Net
Gallery	Gallery Viewer	Gallery Viewer
Metrics	Metrics Viewer	Metrics Viewer
Report	Report Viewer	Report Viewer

Progress Slides

`CUDA` kernels

I am aware that there is still a lot that could be optimized here (warp etc.), but it was only developed for fast analysis

Results

Caveats: No retraining and fine-tuning done yet!

Single Layer replacement with `C=32` and `K=16`

LeViT (Source)

SOTA Vision Transformer on ImageNet 1K

ResNet-50 (only interesting layers in analysis)

Legacy Classifier on ImageNet 1K

Depthwise seperable CNN

on Google Speech v2

`C`, `K` and `encoding_algorithm` parameter sweep for ResNet-50

Data visualizer be sure to select ResNet-50 layers layerX.X.convX

Offline learning convergence on ResNet-50

The goal was to find out how much offline training data is needed to get the maximum accuracy.

Formalism

Some definitions about the forward path.

pengmiao-usc/halutmatmul

Halutmatmul

Algorithmic CI

Hardware CI

General Information

Install

Hackernews mention (comments only) and discussion

Hardware OpenROAD flow results

Total Circuit (M=2)

Encoder

Decoder

Progress Slides

`CUDA` kernels

Results

Single Layer replacement with `C=32` and `K=16`

LeViT (Source)

ResNet-50 (only interesting layers in analysis)

Depthwise seperable CNN

`C`, `K` and `encoding_algorithm` parameter sweep for ResNet-50

Offline learning convergence on ResNet-50

Formalism

Encode kernel

Read and accumulate LUTs kernel

Links

pengmiao-usc/halutmatmul

Halutmatmul

Algorithmic CI

Hardware CI

General Information

Install

Hackernews mention (comments only) and discussion

Hardware OpenROAD flow results

Total Circuit (M=2)

Encoder

Decoder

Progress Slides

CUDA kernels

Results

Single Layer replacement with C=32 and K=16

LeViT (Source)

ResNet-50 (only interesting layers in analysis)

Depthwise seperable CNN

C, K and encoding_algorithm parameter sweep for ResNet-50

Offline learning convergence on ResNet-50

Formalism

Encode kernel

Read and accumulate LUTs kernel

Links

`CUDA` kernels

Single Layer replacement with `C=32` and `K=16`

`C`, `K` and `encoding_algorithm` parameter sweep for ResNet-50