How much compute will this take?

Question

How much compute will this take?

Closed this issue 9 months ago · 7 comments

Hi,
If I want to make a 1B/3B model for Mistral, do you know approximately how many dollars I'll have to spend in compute, and whether I can do it on a consumer GPU? Thanks!

Answer 1 · 2023-11-19T18:15:11.000Z

We use approximantely 1845 A100 GPU hours and 3310 A100 GPU hours to get the 1.3B and 2.7B model. However, the actual execution also heavily dependent on your set up and cluster speed.

Answer 2 · 2023-11-19T20:56:53.000Z

Also, are you planning to release a sheared Mistral version?

Answer 3 · 2023-11-20T03:10:46.000Z

We use an in-house cluster at Princeton! I think A100 should be more expensive than $0.5 per hour though.

Answer 4 · 2023-11-20T03:13:05.000Z

Also, are you planning to release a sheared Mistral version?

We intend to add support for the mistral and pythia models in the upcoming weeks. We are in short of computes -- so I am not sure if we will end up delivering these models before the next stronger 7B model comes out.

Answer 5 · 2023-12-04T03:44:02.000Z

Hi, when having full control over all finetuning data, does it make the most sense to first shear the base model and then finetune on top? Or is it better to finetune in advance (or a mixture of both)? Completely disregarding cost, just purely performance and overfitting related

Answer 6 · 2023-12-04T03:47:54.000Z

Hi! Yeah I think it makes most sense to prune the base model first then finetune, as it's largely believed that the abilities of language models are enabled by pre-training. This is the most neat way to execute.

However, I am not too sure about what the performance will be like when mixing pre-training and fine-tuning data for pruning -- it might have the benefit to help the pruning process find a submodel that better follows instructions.

Answer 7 · 2023-12-04T04:06:08.000Z

Alright, tysm!