CSLR²

Large-Vocabulary Continuous Sign Language Recognition
from Spoken Language Supervision

Charles Raude · Prajwal KR · Liliane Momeni · Hannah Bull · Samuel Albanie · Andrew Zisserman · Gül Varol

Description

Official PyTorch implementation of the paper:

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision.

Please visit our webpage for more details.

Bibtex

If you find this code useful in your research, please cite:

@article{raude2024,
    title={A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision},
    author={Raude, Charles and Prajwal, K R and Momeni, Liliane and Bull, Hannah and Albanie, Samuel and Zisserman, Andrew and Varol, G{\"u}l},
    journal={arXiv},
    year={2024}
}

Installation 👷

Create environment

Create a conda environment associated to this project by running the following lines:

conda create -n cslr2 python=3.9.16
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
conda install anaconda::pandas=1.5.3
conda install conda-forge::einops=0.6.0
conda install conda-forge::humanize=4.6.0
conda install conda-forge::tqdm=4.65.0
pip install hydra-core==1.3.2
pip install matplotlib==3.7.1
pip install plotly==5.14.1
pip install nltk==3.8.1
pip install seaborn==0.12.2
pip install sentence-transformers==2.2.2
pip install wandb==0.14.0
pip install lmdb
pip install tabulate
pip install opencv-python==4.7.0.72

You can also create the environment using the associated .yaml file using conda (this might not always work, depending on the machine and the version of conda installed, try to update the version of conda).

conda env create --file=environment.yaml

After installing these packages, you will have to install a few ntlk packages manually in Python.

import nltk
nltk.download("wordnet")

Set up the BOBSL data

Make sure you have the permission to use the BOBSL dataset. You can request access following the instructions at the official BOBSL webpage.

With the username/password obtained, you can download the two required files via the following:

# Download the pre-extracted video features [262G]
wget --user ${BOBSL_USERNAME} --password ${BOBSL_PASSWORD} \
    https://thor.robots.ox.ac.uk/~vgg/data/bobsl/features/lmdb/feats_vswin_t-bs256_float16/data.mdb
# Download the raw video frames [1.5T] (you can skip this if purely training/testing with features, and not visualizing)
wget --user ${BOBSL_USERNAME} --password ${BOBSL_PASSWORD} \
    https://thor.robots.ox.ac.uk/~vgg/data/bobsl/videos/lmdb/rgb_anon-public_1962/data.mdb

Download bobsl.zip (1.9G) for the rest of the files (including annotations and metadata). Note the folder becomes 15G when decompressed. Make sure they correspond to the paths defined here: config/paths/public.yaml.
Download t5_checkpoint.zip (1.4G) for the T5 pretrained model weights, also defined at config/paths/public.yaml.

Training 🚀

export HYDRA_FULL_ERROR=1  # to get better error messages if job crashes
python main.py run_name=cslr2_train

permits to train the CSLR2 model with the best set of hyperparameters obtained in the paper. Using 4 x V100-32Gb, training for 20 epochs should take less than 20 hours.

To change training parameters, you should be looking at changing parameters in the config/ folder.

To manually synchronise the offline jobs on wandb, one should run: wandb sync --sync-all in the folder of the experiment (do not forget to do export WANDB_MODE=offline first).

Training should save one model per epoch as $EXP_NAME/models/model_$EPOCH_NB.pth. Also, the model that obtains the best T2V performance on validation set is saved as $EXP_NAME/models/model_best.pth.

Test 📊

You can download a pretrained model from here.

1. Retrieval on 25K manually aligned test set

To test any model for the retrieval task on the 25K manually aligned test set, one should run the following command:

python main.py run_name=cslr2_retrieval_25k checkpoint=$PATH_TO_CHECKPOINT test=True

2. CSLR evaluation

CSLR evaluation is done in two steps. First, extract frame-level predictions and then evaluate.

2.1 Feature Extraction

python extract_for_eval.py checkpoint=$PATH_TO_CHECKPOINT

extracts predictions (linear layer classification, nearest neighbor classification) for both heuristic aligned subtitles and manually aligned subtitles.

2.2 Evaluation

python frame_level_evaluation.py prediction_pickle_files=$PRED_FILES gt_csv_root=$GT_CSV_ROOT

Note that by default, if gt_csv_root is not provided, it will use ${paths.heuristic_aligned_csv_root}.

Pre-processing of gloss annotations 💻

You do not need to run this pre-processing, but we release the scripts for how to convert raw gloss annotations (released from the official BOBSL webpage) into the format used for our evaluation. A total of 4 steps are required to fully pre-process gloss annotations that are stored in json files.

1. Assign each annotation to its closest subtitle

python misc/process_cslr_json/preprocess_raw_json_annotations.py --output_dir OUTPUT_DIR --input_dir INPUT_DIR --subs_dir SUBS_DIR --subset2episode SUBSET2EPISODE

where INPUT_DIR is the directory where json files are stored and OUTPUT_DIR is the directory where the assigned annotations are saved. SUBS_DIR is the directory where manually aligned subtitles are saved. This corresponds to the subtitles/manually-aligned files from the public release. SUBSET2EPISODE is the path to the json file containing information about splits and episodes. This corresponds to the subset2episode.json file from the public release.

2. Fix boundaries of subtitles.

During assignment, it could happen that certain annotations overlap with the boundaries of subtitles. It could even happen that certain annotations are not within the boundaries of its associated subtitle. Since at evaluation time, we load all features corresponding to subtitles timestamps, we need to extend boundaries of certain subtitles.

python misc/process_cslr_json/fix_boundaries.py --csv_file OUTPUT_DIR

3. Fix alignment of subtitles.

Subtitles have been manually aligned. However, since gloss annotations are much more treated more precisely, it could happen that certain gloss annotations better match surrounding subtitles. In order to fix this, we propose an automatic re-alignment algorithm.

python misc/process_cslr_json/fix_alignment.py --csv_file OUTPUT_DIR2
python misc/process_cslr_json/preprocess_raw_json_annotations.py --output_dir OUTPUT_DIR3 --input_dir INPUT_DIR --subs_dir OUTPUT_DIR2 --misalignment_fix

where OUTPUT_DIR2 = OUTPUT_DIR[:-8] + "extended_boundaries_" + OUTPUT_DIR[-8:] and OUTPUT_DIR3 = OUTPUT_DIR2[:-8] + "fix_alignment_" + OUTPUT_DIR2[-8:]. Here we assume that OUTPUT_DIR ends with a date in the format DD.MM.YY

4. Only keep lexical annotations.

We only evaluate against lexical annotations: i.e., annotations that are associated with a word.

python misc/process_cslr_json/remove_star_annots_from_csvs.py --csv_root OUTPUT_DIR2  # only boundary extension fix
python misc/process_cslr_json/remove_star_annots_from_csvs.py --csv_root OUTPUT_DIR3  # with total alignment fix

Do all the steps with one command.

Instead, you can also use python misc/process_cslr_json/run_pipeline.py --input_dir INPUT_DIR --output_dir OUTPUT_DIR --subs_dir SUBS_DIR --subset2episode SUBSET2EPISODE

License 📚

This code is developed by Charles Raude, may not be maintained, and is distributed under an MIT LICENSE.

Note that the code depends on other libraries, including PyTorch, T5, Hydra, and use the BOBSL dataset which each have their own respective licenses that must also be followed.

The license for the BOBSL-CSLR data can be found at https://imagine.enpc.fr/~varolg/cslr2/license.txt.

gulvarol/cslr2

CSLR2

Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision