
Abrupt exit when training a model

Hi all,

Thanks for putting this package up! I really love the idea behind it and can't wait to integrate it more tightly with my workflow!

I'm trying to integrate Caliban with one of my smaller projects I'm working on here, but I'm having some trouble getting things to run. I added the requirements.txt file as instructed, but when I run the training script, I don't see any visible error and the process exits abruptly.

I'm using a Mac, and my data is stored at /Users/dilip.thiagarajan/data. Here's exactly what I did:

  • In that repository, I first tried running:
caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data

When I run this from the terminal, I see the following output:

while when I output to log by doing

caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data &> caliban_run.log &

I see the following in my trace:

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/dilip.thiagarajan/.pyenv/versions/3.7.3/lib/python3.7/logging/", line 2039, in shutdown
  File "/Users/dilip.thiagarajan/.pyenv/versions/3.7.3/lib/python3.7/site-packages/absl/logging/", line 864, in close
AttributeError: 'TqdmFile' object has no attribute 'close'

Is this a problem with some interaction with logging and tqdm? Or is it something I'm doing that's incorrect when I'm mounting my data directory?

The following works properly for me locally:
python3 --model_name resnet18 --projection_dim 64 --fast_dev_run True --data_dir ~/data --download

Thanks for your help!

Thanks for the report, @dthiagarajan ! It looks like there are two issues occuring here.

The first, simple one is that you've discovered a bug in the way we were handling the stdout dance when we shell out to build the docker container. I wasn't explicitly flushing stdout, which caused the build steps to appear at the end of your caliban_run.log. I also, as you can see, needed to implement a close() method on TqdmFile. That's all covered in #30 , and we should have a new release out today.

But that's not what's causing the problem in your training job. Looking around a bit it seems that "return code 137" is Docker's way of signaling that it's run out of memory (moby/moby#21083, as an example).

I think this may be a Mac-only problem, and solvable this way:

"Repeating what's said above, this happens on OSX because of Docker 4 Mac's hard memory cap. You can increase your memory limit in Docker App > Preferences > Advanced."

On a Mac, you can click the "Docker Desktop" menu in the menu bar, click "Preferences" and increase the available memory in the "Resources" tab:


I think this is going to be the cleanest solution. I'll poke around and see if there is some setting we can enable by default that will allow Docker to access more memory, or at least catch this error and make it clearer to the user what's going on.

Please let me know if this helps and gets you unblocked! Thanks again for the report, @dthiagarajan , and for testing out Caliban.

This was a world-class bug report, by the way! Thanks for the care it took to write.)

Ah, I hadn't noticed that error code - that seems to fix the memory issue, thanks!

On another note (and more nitpicky), I'm seeing something like the following with the tqdm progress bar updating when running with caliban:

Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/2 [00:00<?, ?it/s]
Epoch 1:   0%|          | 0/2 [00:00<?, ?it/s]
Epoch 1:  50%|█████     | 1/2 [00:03<00:03,  3.29s/it]
Epoch 1:  50%|█████     | 1/2 [00:03<00:03,  3.29s/it, loss=3.435, v_num=5]
Epoch 1: 100%|██████████| 2/2 [00:04<00:00,  2.22s/it, loss=3.435, v_num=5]
Epoch 1: 100%|██████████| 2/2 [00:04<00:00,  2.27s/it, loss=3.435, v_num=5]
Executing:   0%|                                                                                 | 0/1 [00:11<?, ?experiment/s]

whereas when I run locally, I see the following:

Epoch 1: 100%|██████████████████████████████████████████████████████████████| 2/2 [00:20<00:00, 10.12s/it, loss=3.306, v_num=5]

Do I need to specify something when I'm logging in my script? I'm wondering why 1) the progress bar is much longer in the latter compared to the former and 2) why it's logging duplicates.

@dthiagarajan , I knew that this was a problem and I'd tried to fix it before and failed... but you've successfully motivated me to tackle the issue. Progress bars are too awesome to have to give up inside Caliban jobs. (Especially when I'm using tqdm myself to show how many jobs you've completed!)

I've solved this problem in #31. Once I get this merged today, I'll release 0.2.6 and let you know here on this ticket.

Incidentally it makes our tutorial much prettier!

Thanks again for the nudge.

@dthiagarajan The details here are:

  • tqdm uses carriage returns, like \r, to rewrite the current line. Python doesn't pass those through without some work, when you're running another python job in a subprocess.
  • Python buffers its output, which is a mess here, because tqdm uses both stdout and stderr to write its outputs.
  • Docker doesn't have a COLUMNS or LINES variable internally when you run a container in non-interactive mode!

#31 tackles each of these. It's not perfect — I suspect if you nest progress bars, you may run into trouble, but maybe not. If you have a tqdm process and write a bunch of output inside the loop, that might trigger a newline as well.

But this solves most of the issues we'd seen, and I think you'll be happier with the result for sure.

Okay @dthiagarajan , I've just a cut the 0.2.6 release with these changes:

The build should finish shortly and deploy this to pypi. Upgrade with:

pip install -U caliban

and please let us know if this fixes the issue. I'm going to go ahead and close this now, but feel free to re-open if you run into trouble. Thank you!

Awesome work @sritchie, thank you so much!