Fast-slic is a SLIC-variant algorithm implementation that aims for significantly low runtime with cpu. It runs 7-20 times faster than existing SLIC implementations. Fast-slic can process 1280x720 image stream at 60fps.
It started as a part of my hobby project that demanded true "real time" capability in video stream processing. Among pipelines of it was a postprocessing pipeline smoothing the result of image with SLIC superpixels and CRF. Unfortunately, there were no satisfying library for real-time(>30fps) goal. gSLICr was the most promising candidate, but I couldn't make use of it due to limited hardware and inflexible license of CUDA. Therefore, I made the blazingly fast variant of SLIC using only CPU.
pip install fast_slic
import numpy as np
from fast_slic import Slic
from PIL import Image
with Image.open("fish.jpg") as f:
image = np.array(f)
# import cv2; image = cv2.cvtColor(image, cv2.COLOR_RGB2LAB) # You can convert the image to CIELAB space if you need.
slic = Slic(num_components=1600, compactness=10)
assignment = slic.iterate(image) # Cluster Map
print(assignment)
print(slic.slic_model.clusters) # The cluster information of superpixels.
If your machine has AVX2 instruction set, you can make it three times faster using fast_slic.avx2.SlicAvx2
class instead of fast_slic.Slic
. Haswell and newer Intel cpus, Excavator, and Ryzen support this.
import numpy as np
# Much faster than the standard class
from fast_slic.avx2 import SlicAvx2
from PIL import Image
with Image.open("fish.jpg") as f:
image = np.array(f)
# import cv2; image = cv2.cvtColor(image, cv2.COLOR_RGB2LAB) # You can convert the image to CIELAB space if you need.
slic = SlicAvx2(num_components=1600, compactness=10)
assignment = slic.iterate(image) # Cluster Map
print(assignment)
print(slic.slic_model.clusters) # The cluster information of superpixels.
If your machine is ARM with NEON instruction set, which is commonly supported by recent mobile devices and even Raspberry Pi, you can make it two-fold faster by using fast_slic.neon.SlicNeon
class instead of the original one.
With max iteration set to 10, run times of slic implementations for 640x480 image are as follows:
Implementation | Run time(ms) |
---|---|
skimage.segment.slic | 216ms |
cv2.ximgproc.createSuperpixelSLIC.iterate | 142ms |
fast_slic.Slic(single core build) | 20ms |
fast_slic.avx2.SlicAvx2(single core build /w avx2 support) | 12ms |
fast_slic.Slic(w/ OpenMP support) | 8.8ms |
fast_slic.avx2.SlicAvx2(w/ OpenMP, avx2 support) | 5.6ms |
(RGB-to-CIELAB conversion time is not included. Tested with Ryzen 2600x 6C12T 4.0Hz O.C.)
- Windows build is quite slower compared to those of linux and mac. Maybe it is due to openmp overhead?
- It automatically removes small isolated area of pixels at cost of significant (but not huge) overhead. You can skip denoising process by setting
min_size_factor
to 0. (e.g.Slic(num_components=1600, compactness=10, min_size_factor=0)
). The setting makes it 20-40% faster. - To push to the limit, compile it with
FAST_SLIC_AVX2_FASTER
flag and get more performance gain. (though performance margin was small in my pc)
- Publish as a research paper
- Remove or merge small blobs
- Include simple CRF utilities
- Add tests
- Windows build
- More scalable parallel loop in cluster assignment. I suspect there is false sharing problem in the loop.
- would be great if I can optimize loop more. SIMD?