IBM/aihwkit

Analog conversion for Large models

WickedStereo opened this issue · 7 comments

Description

I am currently attempting to convert a Large Language Model (LLAMA7B) to analog using the convert_to_analog function provided by the aihwkit library. However, I'm encountering a challenge due to the size of the model, which is approximately 14 GB. When attempting to perform the conversion on a single GPU, it fails to accommodate the entire data. I also attempted to utilize the A100 GPU with 80 GB of memory, but I encountered a Cuda out-of-memory error.

I am seeking guidance on potential solutions to address this issue. Specifically, I would like to know if there is a way to perform the conversion of such large models in a distributed manner. Alternatively, if there are any other recommended approaches or solutions to overcome the memory constraints during the analog conversion process.

How to reproduce

import os
from transformers import AutoModelForCausalLM

from aihwkit.simulator.configs import TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog

"""    
# Set the new cache directory
os.environ["TRANSFORMERS_CACHE"] = ""
os.environ["HF_HOME"] = ""
os.environ["HF_DATASETS_CACHE"] = ""
"""

# Model from Hugging Face hub, LLAMA-2 7B in this example
base_model = "meta-llama/Llama-2-7b-chat-hf"

"""
# Enter your access token if required by the model
access_token = ""
"""

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    #token=access_token,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

model = convert_to_analog(model, TorchInferenceRPUConfig)

model.remap_analog_weights()

print("analog conversion success")

Expected behavior

Successful analog conversion of large models.

Other information

  • Transformers version: 4.31.0
  • Package version: 0.8.0
  • Python version: 3.9.18

Thank you for opening this. I can reproduce this. I re-worked the MWE to exclude the dependency on LLama.

from aihwkit.simulator.configs import TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog

import torch

model = torch.nn.Sequential(torch.nn.ModuleList([
    torch.nn.Linear(8192, 8192) for _ in range(100)
]))

count = 0
for p in model.parameters():
    count += p.numel()
print(f"Number of parameters is {count:,}")

model = convert_to_analog(model, TorchInferenceRPUConfig())
print("analog conversion success")

My model roughly uses 25GB of memory (since it is in float32), but I need 128GB to convert it. 64GB of RAM doesn't work surprisingly.

Ok so with the following changes, I can convert it using 64GB of RAM:

rpu_config = TorchInferenceRPUConfig()
rpu_config.mapping.max_input_size = 0
rpu_config.mapping.max_output_size = 0
model = convert_to_analog(model, rpu_config, inplace=True)

Basically, you will create only one tile per linear layer, instead of fragmenting it into many small ones.
In your example, you directly load the model into the DRAM of the GPU. Maybe try loading it into RAM of CPU and then converting the model to analog using inplace=True and then moving it into GPU.

@maljoras why do you think the memory requirements get so high when we have tile shape 512x512 in this example? Fragmentation?
If we assume one vector of size 512 extra state per tile, and we have here 16*16*100 tiles in total each with 512 32-bit floats using 4 bytes of memory, we have 16*16*100*512*4/1e9 =~ 0.05 GB extra memory.

@maljoras but I think using linear memory at conversion should be the goal. Meaning that if the analog version of the model fits in RAM, then conversion from the digital to analog also should work given the same amount of RAM.

This code shows that we can indeed construct the model using 32GB of RAM:

from aihwkit.simulator.configs import TorchInferenceRPUConfig
import torch
from aihwkit.nn import AnalogSequential, AnalogLinear

rpu_config = TorchInferenceRPUConfig()
rpu_config.mapping.max_input_size = 0
rpu_config.mapping.max_output_size = 0

model = AnalogSequential(torch.nn.ModuleList([
    AnalogLinear(8192, 8192, rpu_config=rpu_config) for _ in range(100)
]))

I agree, you should try to convert the model to Cpu before converting or enable in place conversion. Otherwise memory usage will at least double as the old model is still in memory.

Right, memory fragmentation also plays a role as torch caches GPU memory which reduces the overall usable space.

@jubueche mapped layers or currently using separate smaller weight matrices, which might indeed lead to fragmentation. One could think of using a big tensor and only accessing views for the mapped layers. However that would only be possible with Torch tiles and might lead to complicated access and handling. One could try to implement a TorchArray (replacing the simulator. tiles. array that is used by the mapped layers) that works with views. Maybe that would solve the issues.

However, @WickedStereo , you should make sure that no other Cuda tensor is on the GPU before conversion and anyway do the analog conversion on Cpu. Only then convert the analog model to GPU. In this case cache fragmentation should not be an issue. I suspect that some of the Cuda tensors in the state dict are copied and thus take additional memory. If you do everything on Cpu and then only convert the analog model to GPU, it is the cleanest way.

Thank you, @jubueche and @maljoras, for resolving the issue. Sorry for the delay in my response. I was able to successfully convert the model.