/dlf2020

3D Sign Language Production for ASL

Primary LanguageJupyter Notebook

dlf2020

The goal of the project is to use deep neural network to translate spoken language to American SignLanguage (ASL) in the form of continuous 3D skeletal poses. For a given English word, we will be outputting a stream of skeletal poses (x, y, and z coordinates) representing ASL signs for the upper body which includes the torso, arms, hands and facial key points

Setup

Use virtual env (used python 3.8.6). Install dependencies with

pip3 install -r requirements.txt

Add virtual env to jupyter notebook

python -m ipykernel install --user --name=your_virtual_env
jupyter notebook --port=your_port_number

Remember to activate environment to install libraries

pynev shell your_virtual_env

Notebook setup

Leverage the following code snippet to use the src directory in notebooks

import sys
# This should navigate to the repository root
sys.path.append('../')
%reload_ext autoreload
%autoreload 2

Project Structure

Most idea pulled from Cookiecutter Data Science

app.py

Main command line interface to run each step of the ML pipeline

  • Create and save word labels and embeddings by running
    python app.py data
    This will create embeddings.pkl and words.pkl in the data/interim folder which contains 2000 entries of the WLASL words. See data/preprocessor.py for more details

data

Where data lives

notebook

Any jupyter notebook should be placed here

metrics

Any output of metrics (MSE..etc) should be saved under here as a csv

src

All code. Here's the breakdown per folder within src.

conf

Config file. Currently has logging set up

data

All code related to data preprocessing to create a trainable data

train

All code related to modeling and traning

validate

All code related to validating the model

Pipeline

Data Preprocessing

Utilize the code by "Words are Our Glosses" paper.

Their code includes (3DposeEstimator/demo.py and wacv2020/pipeline_demo*.py):

  • openpose implementation of converting video to 2d pose estimations
  • correction process for misplaced joints correction
  • z-estimation to convert from 2d to 3d skeletons
  • normalization

Some of these codes already live under the google drive due to the need for Colab.

Generator

Following "Progressive Transformer" paper as it describes the Generator step in detail.

The Progressive Transformer is an extension of classic Transformer made popular by the "Attention is All you need" paper This article and linked youtube was very good explaining how Transformer works.

Utilize one of these implemented transformer in pytorch and tweak it to meet our needs:

Transformer Tweaks - Progressive Transformer

Here's a breakdown of what needs to be tweaked from traditional transformer.

Simplify Encoder Step

Because we aren't using sentences, the encoder step can be much simplified. There's no need for MHA and positional encoder since these layers try to understand the relationship of words within the sentences.

  • Instead of words -> MHA -> Linear Normalization -> Feed Forward (Linear + ReLU + Linear) -> Linear Normalization
  • Implement word -> Feed Forward -> Linear Normalization
Add Counter Embedding

Similar to a "period" to mark the end. The Progressive Transformer has a Counter that is also learned as part of the output. No mention of how this loss is computed but perhaps should be a straightforward distance calculation as well.

Input tweaks - Embeddings

To deal with out-of-vocabulary word, we can use word or character embedding. There are pretrained modesl we can find

Word embedding
Character embedding

While one of the pretrained word embedding encompass a lot of words, there can still be OOV problems. Character embedding can be used to get rid of oov problems commpletely. Perhaps we should do a combination of word and character embedding

Computing Loss for Generator

Compute MSE after applying DTW. Utilize the code written by "Words are Our Glosses". See wacv2020/modeling.py

Discriminator

Conditioanl GAN from the "Adversarial Training for Multi-Channel SLP" paper

Back Translation

Sign Language Transformers: Joint End-to-end Sign Language. Recognition and Translation paper code

Data

Trainable data available: https://drive.google.com/file/d/1-6b7_Rsum_fHTN4kKEUeR5Y3bJ0z76jd/view