Fast-slic is a SLIC-variant algorithm implementation that aims for significantly low runtime with cpu. It runs 7-20 times faster than existing SLIC implementations. Fast-slic can process 1280x720 image stream at 60fps.
It started as a part of my hobby project that demanded true "real time" capability in video stream processing. Among pipelines of it was a postprocessing pipeline smoothing the result of image with SLIC superpixels and CRF. Unfortunately, there were no satisfying library for real-time(>30fps) goal. gSLICr was the most promising candidate, but I couldn't make use of it due to limited hardware and inflexible license of CUDA. Therefore, I made the blazingly fast variant of SLIC using only CPU.
pip install fast_slic
import numpy as np
from fast_slic import Slic
from PIL import Image
with Image.open("fish.jpg") as f:
image = np.array(f)
# import cv2; image = cv2.cvtColor(image, cv2.COLOR_RGB2LAB) # You can convert the image to CIELAB space if you need.
slic = Slic(num_components=1600, compactness=10)
assignment = slic.iterate(image) # Cluster Map
print(assignment)
print(slic.slic_model.clusters) # The cluster information of superpixels.
If your machine has AVX2 instruction set, you can make it three times faster using fast_slic.avx2.SlicAvx2
class instead of fast_slic.Slic
. Haswell and newer Intel cpus, Excavator, and Ryzen support this.
import numpy as np
# Much faster than the standard class
from fast_slic.avx2 import SlicAvx2
from PIL import Image
with Image.open("fish.jpg") as f:
image = np.array(f)
# import cv2; image = cv2.cvtColor(image, cv2.COLOR_RGB2LAB) # You can convert the image to CIELAB space if you need.
slic = SlicAvx2(num_components=1600, compactness=10)
assignment = slic.iterate(image) # Cluster Map
print(assignment)
print(slic.slic_model.clusters) # The cluster information of superpixels.
With max iteration set to 10, run times of slic implementations for 640x480 image are as follows:
Implementation | Run time(ms) |
---|---|
skimage.segment.slic | 216ms |
cv2.ximgproc.createSuperpixelSLIC.iterate | 142ms |
fast_slic.Slic(single core build) | 52ms |
fast_slic.avx2.SlicAvx2(single core build /w avx2 support) | 22ms |
fast_slic.Slic(w/ OpenMP support) | 15ms |
fast_slic.avx2.SlicAvx2(w/ OpenMP, avx2 support) | 11ms |
(RGB-to-CIELAB conversion time is not included. Tested with Ryzen 2600x 6C12T 4.0Hz O.C.)
- Windows build is quite slower compared to those of linux and mac. Maybe it is due to openmp overhead?
- It automatically removes small isolated area of pixels at cost of significant (but not huge) overhead. You can skip denoising process by setting
min_size_factor
to 0. (e.g.Slic(num_components=1600, compactness=10, min_size_factor=0)
). The setting makes it 20-40% faster. - To push to the limit, compile it with
FAST_SLIC_AVX2_FASTER
flag and get more performance gain. (though performance margin was small in my pc)
- Remove or merge small blobs
- Include simple CRF utilities
- Add tests
- Windows build
- More scalable parallel loop in cluster assignment. I suspect there is false sharing problem in the loop.
- would be great if I can optimize loop more. SIMD?