
This is model fit and inference code for CLIP aesthetic regressions trained on Simulacra Aesthetic Captions. These remarkably simple models emulate human aesthetic judgment. They can be used in tasks such as dataset filtering to remove obviously poor quality images from the corpus before training. The following grids, one sorted by John David Pressman and one sorted by the machine give some idea of the models capabilities:

Manually Sorted Grid

A human sorted grid of 20 images from worst to best, starting with the worst image in the top left and the best in the bottom right

Model Sorted Grid

A machine sorted grid of 20 images from worst to best, starting with the worst image in the top left and the best in the bottom right


Git clone this repository:

git clone

Install pytorch if you don't already have it:

pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f

Then pip install our other dependencies:

pip3 install tqdm pillow torchvision sklearn numpy

If you don't already have it installed, you'll need to install CLIP:

git clone
pip3 install .
cd ..


The models are largely meant to be used as a library, i.e. you'll need to write specific code for your use case. But to get you started we've provided a sample script which finds all the .jpg or .png images in a directory tree and ranks the top N (default 50) with the aesthetic model:

python3 demo_images/