"auto" shouldn't use "exact" not "memory" if there is enough memory to load the entire model

Question

"auto" shouldn't use "exact" not "memory" if there is enough memory to load the entire model

dsingal0 opened this issue 6 months ago · 8 comments

"auto" mode shouldn't go to "memory" for on a GPU with 24GB VRAM for 9B if "exact" is faster according to published benchmarks.

Answer 1 · 2024-07-02T08:57:35.000Z

Hey @dsingal0! Thanks for reporting 🤗 I double checked this using an NVIDIA TITAN RTX with 24GB of VRAM:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN RTX    On   | 00000000:19:00.0 Off |                  N/A |
|  0%   31C    P8    37W / 280W |  18245MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Using the "auto" preset, Local Gemma correctly identifies that the "exact" config is appropriate for the hardware (18GB of VRAM of the total of 24):

(lg) sanchit@vorace:~/local-gemma$ local-gemma

Loading model with the following characteristics:
- Model name: google/gemma-2-9b-it
- Device: cuda
- Data type: torch.bfloat16
- Optimization preset: exact
- Generation arguments: {'do_sample': True, 'temperature': 0.7}
- Base prompt: None

I'd love to work with you to figure out why this is not being activated on your machine! It's possible we've missed something in the "auto" config presets, or that we're reading the max GPU memory incorrectly.

Could you confirm what result you get by running:

import torch

total_memory = torch.cuda.get_device_properties("cuda:0").total_memory
print(total_memory)

And also the output of nvidia-smi? These two values should match if our "auto" logic is sound (noting that the first will be in bytes, and the second in giga-bytes). Thanks for your help!

Answer 2 · 2024-07-02T15:32:47.000Z

I was able to repro on a 24GB A10G. Logs below.

23609475072

[Coldboost]
[Coldboost] Loading model with the following characteristics:
[Coldboost] - Model name: google/gemma-2-9b-it
[Coldboost] - Device: cuda
[Coldboost] - Data type: torch.bfloat16
[Coldboost] - Optimization preset: memory
[Coldboost] - Generation arguments: {'do_sample': True, 'temperature': 0.7}
[Coldboost] - Base prompt: None
[Coldboost]
[Coldboost] The capital of France is Paris.
[Coldboost]
[Coldboost]
[Coldboost] LOADING MODEL
[Coldboost] Detected device cuda and defaulting to memory preset.
[Coldboost] Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.14it/s]
[Coldboost] MODEL LOADED
[Coldboost] Completed model.load() execution in 64997 ms

AFTER MODEL IS LOADED:

Answer 3 · 2024-07-02T16:32:29.000Z

Hey @dsingal0 - thanks for the logs, and could you also confirm what the output is of:

import torch

total_memory = torch.cuda.get_device_properties("cuda:0").total_memory
print(total_memory)

Thanks!

Answer 4 · 2024-07-02T16:35:38.000Z

@sanchit-gandhi the output of that is
23609475072

Answer 5 · 2024-07-04T10:21:30.000Z

Thanks @dsingal0 - that's the value we would expect! Given I'm not able to reproduce, could you help me in printing some intermediate values to see what the auto-memory function is returning? We can start by doing an editable installation of local-gemma:

git clone https://github.com/huggingface/local-gemma.git
cd local-gemma
pip install -e ."[cuda]"

Then in infer_memory_requirements, we can add a print statement:

    for preset in DTYPE_MODIFIER.keys():
        dtype_total_size = total_size / DTYPE_MODIFIER[preset]
        inference_requirements = 1.2 * dtype_total_size

+       print(preset, total_size, DTYPE_MODIFIER[preset], inference_requirements, total_memory)
        if inference_requirements < total_memory:
            return preset

This should tell us whether we're computing the inference requirements correctly when you run local-gemma from the CLI

Answer 6 · 2024-07-04T13:20:25.000Z

@dsingal0 Thank you for raising this issue :D

@sanchit-gandhi the output of that is 23609475072

That is ~22GB, as opposed to 24GB. It is a known issue with A10G :( I'm opening a PR to allow A10G to default on exact for the 9B model

Answer 7 · 2024-07-04T13:25:07.000Z

@dsingal0 try installing from main (pipx install git+https://github.com/huggingface/local-gemma.git), and let us know if it is fixed 🤗 A10G should load exact by default now

Answer 8 · 2024-07-04T20:24:11.000Z

@gante confirmed working thanks

Thu Jul 4 20:16:46 2024

+---------------------------------------------------------------------------------------+

| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |

|-----------------------------------------+----------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+======================+======================|

| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |

| 0% 39C P0 75W / 300W | 17888MiB / 23028MiB | 30% Default |

| | | N/A |

+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=======================================================================================|

+---------------------------------------------------------------------------------------+