/Dataset-Augmentation

Primary LanguagePythonApache License 2.0Apache-2.0

Dataset-Augmentation

alt

Introduction

With the files in this repo you can download creative commons images and adjust them for false images. The idea here is that you download images of celebrtites .To avoid jail, you only use creative commons licenced images. However, those are not alwasy the best and lead to false faces. With the files here, you can indentifiy false images and remove them.

The first method would be to compare the images in each folder (one celebrity per folder) and find outliers. Or using the second method to compare all images of a folder to a golden truth from non creative common lincens.

Using the Second Method

If you want to compare to the ground trouth, you will have to first download images using 1a-Image-Crawler.py without the filter = (commercial, reuse) and then use the file 2 -Create embeddings database.py to create so called embeddings. Those Embeddings are then the gold standard.

Detailed description can be found here

Prerequisites

You can download the needed model directly here:

wget ftp://ftp.phytec.de/pub/Software/Linux/Applications/demo-celebrity-face-match-data-1.0.tar.gz
tar -xzf demo-celebrity-face-match-data-1.0.tar.gz

To install the tflite_runtime, download and install this x86 wheel file and install via pip install path_to_file if the above (ARM) does not work.

The Files

  • 1a-Image-Crawler.py crawles images with Bing. The filter is set to commercial, reuse
  • 1b-get_faces_and_crop.py extracts the face and rescale the image to 224x224
  • 1d-proof_images_plotting.py plots the mean error to see the outliers
  • 1c-proof_images_V1.py determine and delete the outliers based on internal analysis
  • 1e-proof_images_V2.py determine and delete the outliers based on comparison to ground trouth
  • 2 -Create embeddings database.py can be used to create golden trouth Embeddings file of non-creative commons license images

License

This project is licensed under the Apache License Version 2.0. See the file LICENSE for detailed information.