/copycatch

Computer vision powered service that quickly filters almost 600GB of images to find a copy of an uploaded image.

Primary LanguageHTML

CopyCatch - efficient copy finder for large databases

Insight Data Engineering Project.
The UI is temporarily displayed on my personal website:
nataliest.com (opens in the same tab, sorry about that)

CopyCatch aims to solve:

  • copyright infringement problem
  • file duplication problem

After filtering almost two million images by tags in under a second, the app performs MSE-error calculation and structural similarity comparison on the filtered images and the incoming image in parallel using Spark.
The image metadata is stored in Redis.

The results displayed on the website were obtained using 6 worker nodes and 18 executors 2 cores each.

pipeline_pp

Dataset: Open Images Dataset: ~600GB of images with tags.
The dataset csv metadata parsing scripts are not included.

├── docs
│   └── CopyCatch_pipeline.png
├── flaskapp
│   ├── assets
│   │   └── aboutme.JPG
│   ├── flaskapp.py
│   ├── flaskapp.wsgi
│   ├── static
│   │   └── js
│   │       └── dist
│   │           ├── components
│   └── templates
│       ├── about.html
│       ├── aboutme.JPG
│       ├── detailed.html
│       ├── index.html
│       ├── stats.html
│       └── tech.html
├── src
│   ├── aws_s3_utils.py
│   ├── copycatch_class.py
│   ├── db_utils.py
│   ├── image_compare.py
│   └── main_spark_submit.py
└── tools
    └── redis_db
       ├── create_label_db4.py
       ├── create_tag_db.py
       ├── create_taglevel_db.py
       ├── create_valid_id_tag_db.py