
Repository to train artificial infants on audiobook corpora

Primary LanguageJupyter Notebook


Please follow instructions at https://github.com/bootphon/zerospeech2021_baseline In the following, we'll assume that experiments will be performed on the Jean Zay cluster.

Clone the CPC_torch repository


All files generated by the experiments (training and evaluation) should be stored under /gpfsscratch/rech/cfs/commun/experiments. Files should be organized as follows :

└─── 50h
     └─── 00
            └─── cpc_<small|big>
                 | kmeans_50 
                 | <lstm|bert_large>
                 | evaluation
└─── same structure

The trainers folder contain scripts to train the different models (CPC, K-means, BERT, lstm). The experiments folder contain scripts to generate the different experiments (training set, model, model parameters). These scripts should generate the configuration of each experiment in a .txt file whose each line will be submitted via a slurm job array


module load sox
conda env create -f environment.yml && conda activate inftrain
git clone https://github.com/MarvinLvn/CPC_torch.git
git clone https://github.com/bootphon/zerospeech2021_baseline

git clone https://github.com/facebookresearch/WavAugment && cd WavAugment && python setup.py develop

To train models, you must install the following dependencies :

Please refer to this git repo for instructions about how to train the model

To evaluate models, you must install the ZeroSpeech 2021 repo

How to connect to the account ?

From flores :

ssh uow84uh@jean-zay.idris.fr

# Load right project (to have access to inftrain conda env)
cd utils 
source cfs_proj.sh

Running experiments

All experiments will be run on Marvin's Jean Zay account. This git repo can be found under /gpfsscratch/rech/cfs/uow84uh/InfTrain with pre-installed dependencies. To run experiments, first type :

cd experiments

This will create experiment files in the experiment_txt folder. There's one experiment file for each model, and each line of an experiment file contains the path to the training set. The information of which model needs to be trained is automatically deduced from the training set path.

Once you generated experiment files, you can check their content and then run :

# To submit CPC small models
sbatch submit_cpc_small.sh

# To submit CPC big models
sbatch submit_cpc_big.sh

One can control the job ids that need to be run with the --array parameter :

sbatch --array=0-5 submit_cpc_small.sh

will only run the first 5 jobs (correspond to the first 5 lines of cpc_experiments.txt)

WARNING : None of the scripts to train k-means and language models work now. Those should be finished and thoroughly checked before running anything.

Submit training individually :

CPC small :

sbatch -o my_log_cpc_small_srun.txt trainers/train_cpc_small.sh /gpfsscratch/rech/cfs/commun/families/EN/50h/00

CPC big :

sbatch -o my_log_cpc_big.txt trainers/train_cpc_big.sh /gpfsscratch/rech/cfs/commun/families/EN/3200h/00

Supervising jobs

You can supervise the progress of the study with the job_status.ipynb jupyter notebook.

To launch jupyter notebook on Jean Zay, see the Accessing Jupyter notebook with Jean-Zay note here : https://wiki.cognitive-ml.fr/howto.html#use-the-jean-zay-cluster.

How it works ?

Each time a model is trained, let's say in EN/50h/00/cpc_small, a file running.state is created. If the model reaches its planned number of epochs, this file will be replaced by done.state. The generate_study.sh script will look for the presence of either one file or the other to decide if a given model needs to be (re-)trained.

However, one should be careful. This system hasn't been thoroughly checked. Not 100% clear to me what happens when there's a memory issue for instance. Will the running.state file be removed ? If no, we'll have to remove them manually so that generate_study.sh knows which models need to be retrained.

What needs to be done ?

  • Submit CPC big models
  • Check CPC models are running / converging (plot validation loss for different training duration)
  • Prepare submission scripts for all the metrics : ABX, sSIMI, sBLIMP, sWUGGY (see with Nick)
  • Finish submission scripts to train k-means and language models.
  • Create submission scripts to extract discrete-representation