Training with half-precision doesn't work for the torch tile or CUDA bindings

Question

Training with half-precision doesn't work for the torch tile or CUDA bindings

coreylammie opened this issue 8 months ago · 10 comments

Description

Training with half-precision doesn't work for the torch tile or CUDA bindings, e..g, when with torch.autocast(device_type="cuda", dtype=torch.bfloat16): is used in conjunction with rpu_config.runtime.data_type = RPUDataType.HALF.

How to reproduce

Convert a model with either TorchInferenceRPUConfig() or TorchInferenceRPUConfig(), specify rpu_config.runtime.data_type = RPUDataType.HALF, and use with torch.autocast(device_type="cuda", dtype=torch.bfloat16): in the training loop.

Expected behavior

https://aihwkit.readthedocs.io/en/latest/api/aihwkit.simulator.parameters.enums.html#aihwkit.simulator.parameters.enums.RPUDataType infers that this is supported.

Answer 1 · 2024-03-08T16:23:48.000Z

I see only fp16 in the Docs that you linked, but you are doing bf16. Does that explain it? And what exactly fails? Can you give a MWE?

Answer 2 · 2024-03-08T16:34:08.000Z

@jubueche no, neither work. The error is different for bfloat16, float16, and for the torch tile and CUDA bindings.

MWE:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from aihwkit.simulator.configs import InferenceRPUConfig, TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.parameters.enums import RPUDataType

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
    
if __name__ == "__main__":
    model = Net()
    rpu_config = InferenceRPUConfig() # or TorchInferenceRPUConfig().
    rpu_config.runtime.data_type = RPUDataType.HALF
    model = convert_to_analog(model, rpu_config) 
    transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=32)
    
    optimizer = AnalogSGD(model.parameters(), lr=0.1)
    model.to(device)
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        with torch.autocast(device_type="cuda", dtype=torch.float16): # or bfloat16
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()

As far as I am aware, we don't have any examples for half precision, so I'm not particularly surprised it doesn't work.

Answer 3 · 2024-03-08T16:40:45.000Z

I see that there are some compile options

option(RPU_USE_FP16 "EXPERIMENTAL: Build FP16 support (only available with CUDA)" OFF)
option(RPU_USE_DOUBLE "EXPERIMENTAL: Build DOUBLE support" OFF)
option(RPU_PARAM_FP16 "EXPERIMENTAL: Use FP16 for (4 + 2) CUDA params" OFF)
option(RPU_BFLOAT_AS_FP16 "EXPERIMENTAL: Use bfloat instead of half for FP16 (only supported for A100+, CUDA 12)" OFF)

@maljoras maybe you know how to enable that?
I can confirm that this currently does not work with the torch tile. I will look into it.

Answer 4 · 2024-03-08T16:44:31.000Z

@coreylammie in which GPUs did you try this one?

Answer 5 · 2024-03-08T17:16:19.000Z

@coreylammie in which GPUs did you try this one?

A100_80GB. Once we do figure this out, it would be great to add an example for it. I intend on adding an example for MobileBERT/SQuAD anyway, so perhaps we can add a single example using half-precision support for this network/task.

Answer 6 · 2024-03-08T17:48:59.000Z

@coreylammie Note that your MWE was not even training in FP32. I have changed it to the below:

import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from aihwkit.simulator.configs import InferenceRPUConfig, TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.parameters.enums import RPUDataType

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
    
if __name__ == "__main__":
    model = Net()
    rpu_config = TorchInferenceRPUConfig()
    model = convert_to_analog(model, rpu_config) 
    nll_loss = torch.nn.NLLLoss()
    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=32)
    
    model = model.to(device=device, dtype=torch.bfloat16)
    optimizer = AnalogSGD(model.parameters(), lr=0.1)
    model = model.train()
    
    pbar = tqdm.tqdm(enumerate(train_loader))
    for batch_idx, (data, target) in pbar:
        data, target = data.to(device=device, dtype=torch.bfloat16), target.to(device=device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output.float(), target)
        loss.backward()
        optimizer.step()
        pbar.set_description(f"Loss {loss:.4f}")

Ideally, this should train. The autocast is unfortunately only supported for CUDA.

Answer 7 · 2024-03-08T17:50:27.000Z

@coreylammie could you use the branch above and see if all tests pass on GPU and you can run the example above? Also, feel free to enter the autocast again. Just remove the .float() cast on the output before you feed into the NLL loss.

Answer 8 · 2024-03-08T18:03:46.000Z

@coreylammie Note that your MWE was not even training in FP32. I have changed it to the below:

import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from aihwkit.simulator.configs import InferenceRPUConfig, TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.parameters.enums import RPUDataType

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
    
if __name__ == "__main__":
    model = Net()
    rpu_config = TorchInferenceRPUConfig()
    model = convert_to_analog(model, rpu_config) 
    nll_loss = torch.nn.NLLLoss()
    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=32)
    
    model = model.to(device=device, dtype=torch.bfloat16)
    optimizer = AnalogSGD(model.parameters(), lr=0.1)
    model = model.train()
    
    pbar = tqdm.tqdm(enumerate(train_loader))
    for batch_idx, (data, target) in pbar:
        data, target = data.to(device=device, dtype=torch.bfloat16), target.to(device=device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output.float(), target)
        loss.backward()
        optimizer.step()
        pbar.set_description(f"Loss {loss:.4f}")

Ideally, this should train. The autocast is unfortunately only supported for CUDA.

@jubueche first, the MWE was not intended to train. It was indended to reproduce the errror, which is raised when loss.backward() is called. Second, the documentation here https://pytorch.org/docs/stable/amp.html#cpu-op-specific-behavior seems to indicate CPU is also supported by autocast. In an example, the following code is listed:


# Creates model and optimizer in default precision
model = Net()
optimizer = optim.SGD(model.parameters(), ...)

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
            output = model(input)
            loss = loss_fn(output, target)

        loss.backward()
        optimizer.step()

Are you sure this is not supported?

Answer 9 · 2024-03-08T19:48:18.000Z

I see. Maybe I forgot to set something. I will check soon. In the meantime, can you check if it runs on GPU in my PR?

Answer 10 · 2024-03-14T15:32:20.000Z

Need to document and add an example.