Speech2Face

Aim of the project:

Establishment of a strong connection between speech and appearance, part of which is a direct result of the mechanics of speech production: (age, gender, mouth shape, facial bone structure, thin or fuller lips).
Voice-Appearance correlations established from the way the person talks: language, accent, speed, pronunciations—such properties of speech are often shared among nationalities and cultures, which can in turn translate to common physical features.
Goal -> To design and develop a Deep-Learning Model that can be used to obtain inference of how a person looks from a short segment in which they shall be talking.

The objective of the project is to investigate how much information about a person's identity can be inferred directly from the way they speak.

Our Speech2Face pipeline consists of two main components:

A voice encoder, which takes a complex spectrogram of speech as an input, and predicts a low-dimensional facial feature that would correspond to the associated face.
A face decoder, which takes as input the facial feature and produces an image of the face in a canonical form (simplest form i.e. frontal-facing). During training, the face decoder is fixed, and we train only the voice encoder that predicts the facial feature. The voice encoder is a model we designed and trained.

The basic idea behind speech-to-face conversion is to generate a realistic image of a person's face based on their spoken words. This requires the model to capture the complex relationships between speech and facial expressions, which can be difficult to do with traditional machine learning approaches.
However, GANs are designed to learn complex, nonlinear relationships between input and output data, and can therefore be effective at generating realistic images of faces that match a given audio input.

FRAMES GENERATED FROM A 6 SEC VIDEO CLIP - TRAINING DATASET

Obtained up to 91 percent of accuracy in the results by inputting a short 6 sec Audio clip of a person talking into the system.
Memory Utilisation: 28 GB
RAM Utilisation: 12 GB
For obtaining more clearer and visibly better images, we can opt for higher-end processing systems.

Utilising more higher end processing systems for obtaining more better picture quality along with clearer indications towards facial features.