smallcloudai/refact

Finetune of deepseek-coder fails

Closed this issue · 7 comments

When I try to train deepseek-coder/5.7b/mqa-base, I get: DataLoader worker (pid(s) 10222) exited unexpectedly
I have tried this several times always with the same result: filtering completes eventually but then finetuning fails before the first iteration. Finetune settings are on the default. No models were being served at the time.

Another person on the Discord channel experiences the same problem.
I have previously successfully tuned Refact/1.6B with basically the same source files.
I am using the current docker image (with the 'latest' tag) on Ubuntu 2204 with an NVIDIA GeForce RTX 3090 with 24 GB VRAM.

Log files: refact_logs.zip
(I redacted 3 filenames in the attached logs and deleted some repetitive lines.)
The log also contains same errors like this a few seconds before the bus error; not sure if it is relevant:
Token indices sequence length is longer than the specified maximum sequence length for this model (28538 > 16384). Running this sequence through the model will result in indexing errors

Thanks for reporting! Sergey @JegernOUTT can you please look if we can fix this quickly?

I see this error in the logs

Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit

Probably we have to increase shm limits in our docker container
@olegklimov
I'll try to reproduce it though, to check possible fixes

FWIW: I ran into the DataLoader worker (pid(s) xxx) exited unexpectedly last night too. This was my first time trying to finetune on a larger number of files (about 8k files - mix of ruby, and TS). I was able to resolve it by adding --shm-size=16384m to my docker run command. I did not do any testing to see what value would resolve the Dataloader issue - so 16G might be way more then is needed in my case.

@matthusby yeah, 16384m might be too much
We're figuring out the correct smallest amount and then will add it to instructions
Thank you for testing!

@ryancu7 @matthusby I've tried different shm sizes, looks like 256m was enough for me.
We are going to add that value to the docker run instructions
If you have a time, you can check if it's enough for your systems

Checked smallcloud/refact_self_hosting:nightly
sha256:caf0d0b8cbe153b9e6e5ceef5b974b222c44c56c1103f936ca1fc081ccd753f0 - OK.

Awesome! I am in the middle of some tuning now, but I will check with 256m when finished. I will update this thread if I run into any problems.