lambdal/deeplearning-benchmark

Entire repository is broken and unmaintained

Opened this issue · 4 comments

If you're looking at this repository and wondering if it's the right thing for you, the short answer will be "no". The TL;DR is that unless you're an employee of Lambda Labs, have your machine and directory structure set up identically to theirs, and have already downloaded all the data sets, it will not work.

Longer explanation:
Amazingly the creators of this tool have managed to recreate the "works for on my machine" using Docker containers and volumes, which somewhat runs counter to the concept of containers, but is particularly impressive.

Among the issues found in the tutorial and their scripts:

  1. It creates and mounts a directory to reuse the large data sets. This is reasonable, but it does so at ~/data. If you already have this directory, hold onto your hat because it will start writing random stuff as root.
  2. It also will create all these other directories and files in your home directory as root, because Docker.
  3. There's essentially no error checking, but huge walls of status that it prints out. This means that when things fail, exactly what happened will immediately disappear past the start your scrollback buffer. (Using set -ex on several of the scripts helps with this problem.)
  4. The various prepare and run scripts assume that you have already created multiple destination directories on your computer, but does not attempt to create them, check for them first, or fail gracefully if they do not exist. E.g.,
benchmark/Translation/GNMT/scripts
benchmark/LanguageModeling/Transformer-XL
benchmark/SpeechSynthesis/Tacotron2/scripts
benchmark/LanguageModeling/BERT/data/squad
benchmark/LanguageModeling/BERT/scripts
benchmark/Recommendation/NCF
  1. When it does try to copy these scripts around, it doesn't actually copy all of them to the right place. This means that several data downloads will fail because they depend on scripts which are not present. At least not on your system. I'm sure they're on the Lambda Labs systems.
  2. Even when you do copy the scripts to the right place, they don't always work. E.g., the very first download attempts to use download_dataset.sh, but that script is not compatible with that dataset. You'll get:
Unsupported dataset name: /data/object_detection
  1. Even when you do manage to start downloading data sets, somewhere in a sea of bash aliases and options, it uses --progress=dot for wget, meaning you'll get literally thousands of lines of:
 20950K .......... .......... .......... .......... .......... 98% 44.4M 0s
 21000K .......... .......... .......... .......... .......... 98% 45.1M 0s
 21050K .......... .......... .......... .......... .......... 98% 37.9M 0s
 21100K .......... .......... .......... .......... .......... 98% 38.0M 0s
 21150K .......... .......... .......... .......... .......... 98% 33.9M 0s

in your terminal. See bullet point (3) about this immediately hiding any previous error messages.

Given the fact that no one from Lambda Labs has attempted to address any of the bugs or PRs raised here in the last three years, your best bet is to move on and find a different benchmark.

I would be more than happy to be proven wrong since this looks to be honestly pretty amazing, but sadly the track record with the rest of the repo here means I'm not holding out much hope.

(P.S. Please prove me wrong.)

Have you been able to get any results period? I have been able to finally get results and benchmarks working beautifully with some minor config changes and ensuring all the paths are correct.

@tfgjustin

We just released a properly dockerized training benchmark for CV models if you're interested: https://github.com/tensorpix/benchmarking-cv-models

You just pull the repo and run the container... it shouldn't take more than 5 minutes to figure everything out.

Thank you! I'll check it out tomorrow!