fastdup is a tool for gaining insights from a large image/video collection. It can find anomalies, duplicate and near duplicate images/videos, clusters of similarity, learn the normal behavior and temporal interactions between images/videos. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be sent for tagging.
fastdup is:
- Unsupervised: fits any dataset
- Scalable : handles 400M images on a single machine
- Efficient: works on CPU only
- Low Cost: can process 12M images on a $1 cloud machine budget
From the authors of GraphLab and Turi Create.
Large Image Datasets Today are a Mess Blog | Processing LAION400m Video
- Python 3.7, 3.8, 3.9, 3.10
- Supported OS: Ubuntu 20.04, Ubuntu 18.04, Debian 10, Mac OSX M1, Mac OSX Intel, Amazon Linux 2, CentOS 7, RedHat 4.8, Windows 10 Server.
# upgrade pip to its latest version
python3.XX -m pip install -U pip
# install fastdup
python3.XX -m pip install fastdup
Where XX is your python version. For Windows, CentOS 7.X, RedHat 4.8, Amazon Linux 2 and other older Linux see our Insallation instructions.
import fastdup
fastdup.run(input_dir="/path/to/your/folder", work_dir='out', nearest_neighbors_k=5, turi_param='ccthreshold=0.96') #main running function.
fastdup.create_duplicates_gallery('out/similarity.csv', save_path='.') #create a visual gallery of found duplicates
fastdup.create_outliers_gallery('out/outliers.csv', save_path='.') #create a visual gallery of anomalies
fastdup.create_components_gallery('out', save_path='.') #create visualiaiton of connected components
fastdup.create_stats_gallery('out', save_path='.', metric='blur') #create visualization of images stastics (for example blur)
fastdup.create_similarity_gallery('out', save_path='.',get_label_func=lambda x: x.split('/')[-2]) #create visualization of top_k similar images assuming data have labels which are in the folder name
fastdup.create_aspect_ratio_gallery('out', save_path='.') #create aspect ratio gallery
Working on the Food-101 dataset. Detecting identical pairs, similar-pairs (search) and outliers (non-food images..)
- 🔥 Finding duplicates, outliers and connected components in the Food-101 dataset, including Tensorboard Projector visualization - Google Colab
- 🔥🔥 Visualizing and understanding a new dataset, looking at dats outliers and label outliers, Training a baseline KNN classifier and getting to accuracy of 0.99 by removing confusing labels
- Finding wrong lables via image similarity
- Computing image statistics
- Using your own onnx model for extraction
- Getting started on a Kaggle dataset
- Deduplication of videos - Google Colab
- Analyzing video of the MEVA dataset - Google Colab
- Working with multipe labels per image
- Detailed instructions, install from stable release and installation issues
- Detailed running instructions
Stroke AIS Data Tire Data Butterfly Mimics Drugs and Vitamins Plastic Bottles Micro Organisms PCB Boards ZebraFish Whats the difference
Usage Tracking
We have added experimental crash report collection, using sentry.io. It does not collect user data other than anonymized IP address data, and it only logs fastdup library's own actions. We do NOT collect folder name, user name, image names, image content only aggregate performance statistics like total number of images, average runtime per image, total free memory, total free disk space, number of cores etc. Collecting fastdup crashes will help us improve stability.
The code for the data collection is found here. On MAC we use Google crashpad.
It is always possible to opt out of the experimental crash report collection via either of the following two options:
- Define an environment variable called
SENTRY_OPT_OUT
- or run() with
turi_param='run_sentry=0'