BUG: Predict throws `KeyError`

Question

BUG: Predict throws `KeyError`

avanikop opened this issue 8 months ago · 2 comments

Facing problems with vak predict step. Tried setting save_net_output= true as well as = false and got the same error:
I dont know if this belongs on vocalpy or here, but I have been having problems during the vak predict predict.toml step. It always gives this error:

  Traceback (most recent call last):
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/bin/vak”, line 10, in
  sys.exit(main())
  ^^^^^^
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/main.py”, line 48, in main
  cli.cli(command=args.command, config_file=args.configfile)
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/cli.py”, line 49, in cli
  COMMAND_FUNCTION_MAPcommand
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/cli.py”, line 18, in predict
  predict(toml_path=toml_path)
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/predict.py”, line 51, in predict
  core.predict(
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/core/predict.py”, line 231, in predict
  y_pred = pred_dict[spect_path]
  ~~~~~~~~~^^^^^^^^^^^^
  KeyError: ‘/gpfs01/veit/data/thomas/avanitesting2/audio/tweetynet_testdata/spectrograms_generated_240415_152302/bu01bk01_240118_43355031_120323.wav.spect.npz’
  0%| | 0/16 [00:00<?, ?it/s]
  Traceback (most recent call last):
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/bin/vak”, line 10, in
  sys.exit(main())
  ^^^^^^
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/main.py”, line 48, in main
  cli.cli(command=args.command, config_file=args.configfile)
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/cli.py”, line 49, in cli
  COMMAND_FUNCTION_MAPcommand
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/cli.py”, line 18, in predict
  predict(toml_path=toml_path)
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/cli/predict.py”, line 51, in predict
  core.predict(
  File “/gpfs01/veit/user/akoparkar/miniconda3/envs/vakenv/lib/python3.12/site-packages/vak/core/predict.py”, line 231, in predict
  y_pred = pred_dict[spect_path]
  ~~~~~~~~~^^^^^^^^^^^^
  KeyError: ‘/gpfs01/veit/data/thomas/avanitesting2/audio/tweetynet_testdata/spectrograms_generated_240415_152302/bu01bk01_240118_43310843_120220.wav.spect.npz’

It is not file-specific - I changed the dataset and got the same result.

Working on cluster with multiple GPUs
Possible solution suggested already: $ CUDA_VISIBLE_DEVICES=0 vak predict my_config
seems to work for now

Answer 1 · 2024-04-17T12:59:21.000Z

Thank you @avanikop for catching this and helping me track down the source of the issue.

As you pointed out by email

One (unrelated) thing I noticed is, it creates two "result_datestamp_timestamp" folder during the same run and the folder with the later timestamp has no max val checkpoint file in the checkpoints folder.

I think you were talking about what happens when you run vak train but you made me realize that the same issue I'm seeing in #742 might also be causing this error with predict: it's because lightning is defaulting to a "distributed" strategy.

I can verify that lightning running in distributed mode is indeed the source of the bug 😩
If I run on a machine with multiple GPUs I reproduce your bug with predict.

A workaround for now is to do the following before you run vak:

export CUDA_VISIBLE_DEVICES=0

Basically you force lightning to not run in distributed mode by making it see there's only one GPU.

If I do this then vak predict runs without this error, and the same workaround applies for vak learncurve, and presumably vak train

Thank you for pointing out you were seeing an extra folder get generated for train -- I thought that was only happening with learncurve. You got me to the root of the problem.

My guess for what's going on is that something about how lightning runs in distributed causes us to end up with some keys missing from the dictionary returned by the predict method.

Just so it's clear what I did:
you can do this in two lines

$ export CUDA_VISIBLE_DEVICES=0
$ vak predict your_config.toml

or one (in a way that doesn't "export" the variable to the environment)

$ CUDA_VISIBLE_DEVICES=0 vak predict your_config.toml

Since we're seeing it in train + predict, this means I need to fix this bug sooner rather later.
I've been needing to do this for my own experiments anyways.
The fix will be something like adding a gpus option that gets passed directly to lightning.Trainer and then defaulting to a single device.

I will raise a separate issue with a fix

Answer 2 · 2024-05-05T21:49:03.000Z

Fixed by #752