oto.compress failed with "xs.append(param.data.view(cc.num_groups, -1))" in graphy.py

Question

oto.compress failed with "xs.append(param.data.view(cc.num_groups, -1))" in graphy.py

Closed this issue a year ago · 47 comments

@tianyic Hi, when I tried OTO with the following case, oto.compress failed. Could you please give some advice?

import torch
import torch.nn as nn
from only_train_once import OTO


class DemoNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()

        
        self.fc = nn.Sequential(
            nn.Linear(1024, 512),
            nn.Linear(512, 256)
        )

    def forward(self, x):

        # x: [1, 512, 2, 81]
        x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
        x = x.squeeze(-1)
        return self.fc(x)

if __name__ == "__main__":
    
    model = DemoNet()
    model.eval()
    fake_input = torch.randn((1, 512, 2, 81))
    print(f"{model(fake_input).shape}")
    oto = OTO(model=model, dummy_input=fake_input)
    oto.compress()

Answer 1 · 2023-03-22T14:49:13.000Z

Thanks for reaching out. I have taken a quick look. It seems that the below lines (which seems a bit unnecessary for me).

x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1)

changes the torch trace graph's construction, especially the stem vertices type from linear or gemm to matmul. But the matmul is currently not included into the supported operators yet, which caused the trouble.

See the below dependency graph with normal input for linear layer fake_input = torch.randn((1, 1024))

def forward(self, x):
      return self.fc(x)

fake_input = torch.randn((1, 1024))
oto = OTO(model=model, dummy_input=fake_input)

versus the dependency graph under the exampling numerous preprocessing operators

def forward(self, x):
      x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
      x = x.squeeze(-1)
      return self.fc(x)

fake_input = torch.randn((1, 512, 2, 81))
oto = OTO(model=model, dummy_input=fake_input)

Answer 2 · 2023-03-22T14:55:31.000Z

Please see my below comments regarding how to utilize OTO more properly.

Set ZIGs to be zero before compression. To get a compressed model promptly, besides training the model via DHSPG to yield highly group sparse solution in the manner of ZIG, we can also randomly set a subset of ZIGs being zero. Then call the compress API. For example,

import torch
import torch.nn as nn
from only_train_once import OTO

class DemoNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(1024, 512),
            nn.Linear(512, 256)
        )

    def forward(self, x):
        return self.fc(x)

if __name__ == "__main__":
    
    model = DemoNet()
    model.eval()
    fake_input = torch.randn((1, 1024))
    print(f"{model(fake_input).shape}")
    oto = OTO(model=model, dummy_input=fake_input)
    oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()

Leverage Dependency Graph Visualization to visualize the dependency graphs which usually promptly reveal the root cause of potential failures. See the above oto.visualize_zigs(view=False), where a $model_name.pdf would be generated.
Check if the operator displayed in dependency graph is supported by OTOv2.

Hope the above help. Meanwhile, we are working on developing the next generation of the library and will keep adding more tutorials and documentations. Thanks for the usage of our tool! Feel free to leave any other feedback.

Answer 3 · 2023-03-23T02:54:15.000Z

@tianyic Thanks.
It seems that conv1d is exactly an alternative choice. By the way, I'm wondering whether we can configure OTO with a black list, where the unsupported operators can be automatically ignored and kept intact during pruning.
Also I think if we add some functionality to round the number of pruned channels to the expected number (32 or 16 or 8, for example), it will be useful for deployment on edge devices such as NPU.

class DemoNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()

        self.conv1d = nn.Sequential(
            nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
            nn.Conv1d(512, 256, 1, 1, 0, bias=True)
        )

    def forward(self, x):

        # x: [1, 512, 2, 81]
        x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
        x = x.squeeze(-1).permute(0, 2, 1)
        return self.conv1d(x)


if __name__ == "__main__":

    model = DemoNet()
    model.eval()
    fake_input = torch.randn((1, 512, 2, 81))
    print(f"{model(fake_input).shape}")
    oto = OTO(model=model, dummy_input=fake_input)
    oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()

Answer 4 · 2023-03-23T04:04:25.000Z

@tianyic How can I configure the parameters of oto.dhspg when using the Adamw optimizer?

Answer 5 · 2023-03-23T04:50:14.000Z

Glad that you find some alternative operators to make the library work. The black list is a good idea. We will consider upon our bandwidth.

An official tutorial regarding applications with Adam and AdamW will be provided in about 2-3 weeks. For a hotfix for your question, please try the below optimizer setting.

optimizer = oto.dhspg(
        variant="adamw",
        lr=1e-3, # set same as the baseline training
        weight_decay=1e-2,  # set same as the baseline training
        first_momentum=0.9, # set same as the baseline training
        second_momentum=0.999, # set same as the baseline training
        dampening=0.0, # set same as the baseline training
        target_group_sparsity=0.8,  # choose upon how much you wanna compress
        start_pruning_steps=X* len(trainloader), # start pruning after X epochs, depends on total epochs, start pruning 1/5 total epochs is typically fine.
        lmbda=1e-2, # larger value promote group sparsity more effectively
        lmbda_amplify=20, # larger value promote group sparsity more effectively
        hat_lmbda_coeff=1e3, # larger value promote group sparsity more effectively
        epsilon=0.0  # enlarge it if group sparsity does not meet target_group_sparsity. 
    )

Answer 6 · 2023-03-23T05:50:31.000Z

@tianyic Thanks.
When I execute the step 2 of the pipeline, will the learned group_sparsity in the step 1 be reset from scratch?
1、oto training -> save model & optimizer checkpoint ->stop training
2、load checkpoint -> oto training resume -> oto.compress

Also I'm wondering if I can export the pruned model onnx through the pipeline:
1、oto training -> save model & optimizer checkpoint ->stop training
2、load checkpoint -> oto.compress

Before reaching the start_pruning_steps, what are the differences by using the oto.dhspg optimizer compared with the original torch AdamW optimizer? How does the start_pruning_steps make effects on the accuracy of the pruned model?
Which parameter of oto.dhspg will be dominant to the accuracy of the pruned model?

Answer 7 · 2023-03-23T16:31:34.000Z

Both pipelines are supported. But for the first pipeline, to preserve the learned group sparsity, need to set the argument in dhspg optimizer fixed_zero_groups=True, then resumes the oto training.

One more trick, not sure if you met or not. During the pruning when the group sparsity is increasing, the loss function may regress a bit upon applications. If so, don’t worry, once the group sparsity reaches the target, the loss function will decrease again till ultimate convergence.

Answer 8 · 2023-03-23T16:39:31.000Z

This is a good question regarding start_pruning_steps, which we will come up with detailed explanations regarding DHSPG, maybe a video tutorial.

In short, DHSPG is a hybrid optimizer, it applies the baseline optimizer over all variables before starting pruning and over the variables that are considered as potentially important during pruning. For the variables that are considered as maybe redundant, a step called Half-space step is proceeded to yield them onto zero. Once group sparsity reaches the target, the optimizer performs as the baseline optimizer till ultimate convergence.

The ultimate accuracy typically depends on 1. how the baseline model can achieve, 2. if gives fairly enough steps for warming up, and 3.if gives sufficiently many steps after reaching target group sparsity.

More documentations and tutorials will be provided, where we will show more detailed instructions.

Answer 9 · 2023-03-24T05:41:25.000Z

@tianyic Thanks. Looking forward to the tutorials.
It seems that DHSPG optimizer slows down compared with the torch adamw optimizer. Could you please give some advice for speeding up the optimizer?

Answer 10 · 2023-03-24T06:22:51.000Z

A good question.

DHSPG optimizer is a hybrid optimizer which indeed has some computational overhead during pruning (when group sparsity is increasing). The overhead is typically varying upon model and dataset. For majority models, the overhead is negligible, but some are not (the worst case I met would double the cost). But remark here that the overhead is temporary and will disappear once group sparsity reaches the target value (afterwards the DHSPG performs the same as the baseline optimizer).

Therefore, to speed up, I would suggest shrinking the pruning procedure, i.e., to make the group sparsity increase faster to reach the target value, which can be typically achieved via fine-tuning the hyperparameters related to group sparsity exploration. In fact, most of experiments I conducted could shrink the pruning stage into just a few epochs, which largely mitigates the overhead.
Meanwhile, there might be some engineering tricks in the official torch version that could be leveraged to further speedup the DHSPG.

Hope the above help.

Answer 11 · 2023-03-25T07:43:55.000Z

@tianyic Hi, when I apply OTO to a c2f module used in YOLOv8, it failed with the error about the slice and concat operations. How can I solve the problem?

Traceback (most recent call last):
  File "test_oto_c2f.py", line 117, in <module>
    oto = OTO(model=model, dummy_input=fake_input)
  File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/__init__.py", line 17, in __init__
    self.partition_zigs()
  File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/__init__.py", line 28, in partition_zigs
    self._graph = automated_partition_zigs(self._graph)
  File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/zig/zig.py", line 125, in automated_partition_zigs
    graph.set_zigs(opt)
  File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/graph/graph.py", line 417, in set_zigs
    dfs_helper(self, auxilary_cc, auxilary_cc.dependent_stem_ccs)
  File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/graph/graph.py", line 410, in dfs_helper
    node_in = graph.nodes[node_in_id]
KeyError: 'out-28'

[debug] concat_node.inputs = ['out-28', 'out-29', 'out-35']
[debug] graph.nodes = dict_keys(['out-25', 'out-26', 'out-27', 'out-28-29', 'out-30', 'out-31', 'out-32', 'out-33', 'out-34', 'out-35', 'out-36', 'out-37', 'out-38', 'out-39'])

from typing import Callable
import torch
import torch.nn as nn
from functools import partial
from only_train_once import OTO

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1)  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):  # ch_in, ch_out, shortcut, groups, kernels, expand
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))


class C2f(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        # slice
        y = list(self.cv1(x).chunk(2, 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

    def forward_split(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

class C2fModule(nn.Module):

    def __init__(self, c1=512, c2=256):
        super().__init__()
        self.c2f = C2f(c1, c2, n=1, shortcut=False, g=1, e=0.5)

    def forward(self, x):
        return self.c2f(x)



if __name__ == "__main__":

    model = C2fModule()
    model.eval()
    fake_input = torch.randn((1, 512, 4, 80))
    print(f"{model(fake_input).shape}")
    oto = OTO(model=model, dummy_input=fake_input)
    oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()

Answer 12 · 2023-03-25T12:41:15.000Z

Thanks for the above example @songkq . Will take a look in the weekdays and provide a guidance later.

Answer 13 · 2023-03-27T21:06:19.000Z

Thanks for the example @songkq. I have taken a quick look. We will support slice operator better in the future release.

For a hotfix, please see the below alternative way that avoids slice, where I decompose the conv following slice into two separate convs.

import torch
import torch.nn as nn
from only_train_once import OTO

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1)  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=True)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):  # ch_in, ch_out, shortcut, groups, kernels, expand
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))


class C2f(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, self.c, 1, 1)
        self.cv2 = Conv(c1, self.c, 1, 1)
        self.cv3 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        y = [self.cv1(x), self.cv2(x)]
        y.extend(m(y[-1]) for m in self.m)
        return self.cv3(torch.cat(y, 1))

    def forward_split(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

class C2fModule(nn.Module):
    def __init__(self, c1=512, c2=256):
        super().__init__()
        self.c2f = C2f(c1, c2, n=1, shortcut=False, g=1, e=0.5)

    def forward(self, x):
        return self.c2f(x)

if __name__ == "__main__":

    model = C2fModule()
    model.eval()
    fake_input = torch.randn((1, 512, 4, 80))
    print(f"{model(fake_input).shape}")
    oto = OTO(model=model, dummy_input=fake_input)
    oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()

    import onnxruntime as ort
    full_ort_sess = ort.InferenceSession(oto.full_model_path)
    compress_ort_sess = ort.InferenceSession(oto.compressed_model_path)
    
    full_output = full_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
    compress_output = compress_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
    print("Output difference:")
    print(full_output - compress_output)

The full and compressed models yield the same outputs. Hope the above help.

Answer 14 · 2023-03-28T07:07:26.000Z

@tianyic Thanks. I will try it out.
I met another problem that group_sparsity, omega = optimizer.compute_group_sparsity_omega(), where the returned group_sparsity keeps always zero during training even after reaching the settingstart_pruning_steps. I set the oto.dhspg optimizer as the following. I'm confusing why oto didn't take effect.

target_group_sparsity: 0.1
start_pruning_steps: 1000
hat_lmbda_coeff: 10.0
lmbda: 0.001
lmbda_amplify: 2.0

optimizer = oto.dhspg(
        variant="adamw",
        lr=1e-3,
        weight_decay=1e-2,  
        first_momentum=0.9, 
        second_momentum=0.999, 
        dampening=0.0,
        target_group_sparsity=0.1,  
        start_pruning_steps=1000, 
        lmbda=1e-3, 
        lmbda_amplify=2.0, 
        hat_lmbda_coeff=10,
        epsilon=0.95
    )

Answer 15 · 2023-03-28T14:23:47.000Z

A good question @songkq . It is largely due to the settings of hyperparameter. Adamw and sgd they typically requires different settings for lambda (group sparsity exploration) related due to the different gradient estimation mechanisms. Please take a try as the below. We will cover it in the coming tutorials.

Meanwhile, we have ongoing plan to further optimize and simplify the hyperparameter lists to bring more convenience for the users including ourselves (since we are actively applying OTO onto a lot of DNN application-track research and products).

optimizer = oto.dhspg(
        variant="adamw",
        lr=1e-3, 
        weight_decay=1e-2,  
        first_momentum=0.9, 
        second_momentum=0.999,
        dampening=0.0, 
        target_group_sparsity=0.1,  
        start_pruning_steps=1000, 
        lmbda=1e-2, # larger value promote group sparsity more effectively
        lmbda_amplify=20, # larger value promote group sparsity more effectively
        hat_lmbda_coeff=1e3, # larger value promote group sparsity more effectively
        epsilon=0.95  # larger value promote group sparsity more effectively 
    )

Answer 16 · 2023-03-28T19:45:50.000Z

I updated the repo for auto selecting hyper-parameter for different variants. Could just set the optimizer up as

optimizer = oto.dhspg(
        variant="adamw",
        lr=1e-3, 
        target_group_sparsity=0.1,  
        start_pruning_steps=1000, 
    )

should work for the majority of the experiments.

Answer 17 · 2023-03-29T01:18:26.000Z

@tianyic Good job. Thanks!

Answer 18 · 2023-04-02T13:51:46.000Z

@tianyic Hi, I have attempted to prune my network with a target_group_sparsity of 0.1/0.35/0.5. However, I found that only the last two layers self.conv1d were pruned while failing to prune the cnn_backbone. I'm confusing why oto cannot globally prune the network.

class DemoNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()

        self.cnn_backbone = ...
        self.conv1d = nn.Sequential(
            nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
            nn.Conv1d(512, 256, 1, 1, 0, bias=True)
        )

    def forward(self, x):

        x = self.cnn_backbone(x)
        # x: [1, 512, 2, 81]
        x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
        x = x.squeeze(-1).permute(0, 2, 1)
        return self.conv1d(x)

Answer 19 · 2023-04-02T15:02:40.000Z

@songkq Could you please share the dependency graph for me? I can then take a quick look.

You are right OTO is working on globally pruning the whole networks. Your issue would be typically resolved via minor adjustments either to the network arch or the operator list in my gut feeling.

Answer 20 · 2023-04-02T15:04:31.000Z

Just in case if the dependency graph is confidential, you could send it via email Tianyi.Chen@microsoft.com

Answer 21 · 2023-04-03T03:15:29.000Z

Meanwhile, I would recommend to proceed with sanity check before engaging into DHSPG training @songkq . The sanity check would randomly set up a set of ZIGs to be zero, and a compressed model would be generated afterwards. If the compressed model looked normal and returned the exact same output as the full model given the same random input, then the sanity check passed. Afterwards, DHSPG is triggered to train and identify redundant groups from the view of optimization rather than random selection.

oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
oto.compress()

import onnxruntime as ort
full_ort_sess = ort.InferenceSession(oto.full_model_path)
compress_ort_sess = ort.InferenceSession(oto.compressed_model_path)
    
full_output = full_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
compress_output = compress_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
print("Output difference:")
print(full_output - compress_output) # Should be merely as zeros.

Answer 22 · 2023-04-03T03:42:42.000Z

@songkq Please take a look at this newly raised issue, which I suspect might be the similar situation as yours. If so, please let me know if your onnx version is also 1.14. Thanks.

#13

Answer 23 · 2023-04-03T09:26:00.000Z

@tianyic Thanks.
I have done the sanity check. It exactly shows that only the last two layers were pruned by oto.random_set_zero_groups() and oto.compress(). The maximum value of diffrence between full_output and compress_output is about 4.4703484e-08. I'm wondering if the reshape and transpose operation cause the problem.

x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1).permute(0, 2, 1)

testcase:

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1)  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):  # ch_in, ch_out, shortcut, groups, kernels, expand
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))


class C2f_rep(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()

        self.kwargs = {"c1": c1, "c2": c2, "n": n, "shortcut": shortcut, "g": g, "e": e}

        self.c = int(c2 * e)  # hidden channels
        self.cv0 = Conv(c1, self.c, 1, 1)
        self.cv1 = Conv(c1, self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        # slice
        # y = list(self.cv1(x).chunk(2, 1))
        y = [self.cv0(x), self.cv1(x)]
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

    def forward_split(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

class rC2fModule(nn.Module):

    def __init__(self, c1=512, c2=256):
        super().__init__()

        self.c2f = C2f_rep(c1, c2, n=1, shortcut=False, g=1, e=0.5)

    def forward(self, x):

        return self.c2f(x)

class DemoC2fNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        
        self.c2f = rC2fModule(c1=512, c2=512)
        self.conv1d = nn.Sequential(
            nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
            nn.Conv1d(512, 256, 1, 1, 0, bias=True)
        )

    def forward(self, x):

        x = self.c2f(x)
        # x: [1, 512, 2, 81]
        x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
        x = x.squeeze(-1).permute(0, 2, 1)
        return self.conv1d(x)


if __name__ == "__main__":

    model = DemoC2fNet()
    model.eval()
    fake_input = torch.randn((1, 512, 2, 81))

    oto = OTO(model=model, dummy_input=fake_input)
    # oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()

    exit()

My envs:

torch == 1.8.1
onnx == 1.10.1

Answer 24 · 2023-04-03T17:32:52.000Z

Thanks for sharing @songkq . Will take a look this week. Quite occupied the early of this week.

Answer 25 · 2023-04-05T05:25:57.000Z

@songkq Thanks for the example. I took a quick look at this example. There exists some tensor alignment issues due to the discrepancy among different dependencies version. For your case, could you please try to enable the bias as True. We will proceed more rigorous improvements to make the tensor alignment to be more robust against varying dependencies. For more reliably using OTO, I do suggest enabling bias for layers as True, also for normalization layers such as BN set their affine as True.

self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=True)

Then run the sanity check again

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1)  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=True)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):  # ch_in, ch_out, shortcut, groups, kernels, expand
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))


class C2f_rep(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()

        self.kwargs = {"c1": c1, "c2": c2, "n": n, "shortcut": shortcut, "g": g, "e": e}

        self.c = int(c2 * e)  # hidden channels
        self.cv0 = Conv(c1, self.c, 1, 1)
        self.cv1 = Conv(c1, self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        # slice
        # y = list(self.cv1(x).chunk(2, 1))
        y = [self.cv0(x), self.cv1(x)]
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

    def forward_split(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

class rC2fModule(nn.Module):

    def __init__(self, c1=512, c2=256):
        super().__init__()

        self.c2f = C2f_rep(c1, c2, n=1, shortcut=False, g=1, e=0.5)

    def forward(self, x):

        return self.c2f(x)

class DemoC2fNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        
        self.c2f = rC2fModule(c1=512, c2=512)
        self.conv1d = nn.Sequential(
            nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
            nn.Conv1d(512, 256, 1, 1, 0, bias=True)
        )

    def forward(self, x):

        x = self.c2f(x)

        # # x: [1, 512, 2, 81]
        x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
        x = x.squeeze(-1).permute(0, 2, 1)
        return self.conv1d(x)


if __name__ == "__main__":

    model = DemoC2fNet()
    # model = rC2fModule()
    model.eval()
    fake_input = torch.randn((1, 512, 2, 81))

    oto = OTO(model=model, dummy_input=fake_input)
    oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()
    import onnxruntime as ort
    full_ort_sess = ort.InferenceSession(oto.full_model_path)
    compress_ort_sess = ort.InferenceSession(oto.compressed_model_path)
    
    full_output = full_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
    compress_output = compress_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
    print(full_output - compress_output)

It passed on my end, where the maximum difference between full and compressed models is 1e-7.

Answer 26 · 2023-04-05T05:40:56.000Z

@songkq I attached the full and compressed models during one sanity check at Baidu Pan.

链接: https://pan.baidu.com/s/15i-8p_8Ko2R6YGzeT5FGdw 提取码: np46

My experiment setting is torch 1.13, onnx=1.12.

Answer 27 · 2023-04-05T14:51:13.000Z

@tianyic Thanks.
Since normalization layers such as BN have been used in the model, the bias of nn.Conv2d is always set as False. I'm wondering if I can set the bias as True while freezing them as zeros during training considering the compatibility of using OTO. Then during inference, the bias of Conv2d can be merged with BN as normal.
Will this way has an influence of the model accuracy when using the oto.dhspg optimizer?

Answer 28 · 2023-04-05T15:28:29.000Z

@songkq

This is a great question. For short-term, setting bias as True should not deliver worse result. Please see the below explanations.

From the view of optimization, if bias = 0 is indeed optimal, then we make it trainable, during training, it should converge to zero eventually. In words, there would be no huge difference between bias as False and True if bias = 0 being optimal. But if optimal bias does not equal to 0, make it trainable could help achieving more optimal solution.

On the other hand, Conv-BN fusion works for non-zero bias as well. Thus, it does not matter that much to fuse bias as True/False or fix them as zero during the training. We have applied OTO onto a lot low-level and high-level vision models. Majority of them could achieve competitive performance to the full models with significant FLOPs reduction, a few of them even outperforms full versions.

The root cause of this issue is due to some tensor alignment between onnx file and torch model, which we will make rigorous fix. Therefore, for long-term, please wait for our fix.

Answer 29 · 2023-04-05T15:39:39.000Z

@tianyic Thanks. I will try it out with the bias = True in my case.
By the way, is there a fast way to estimate the maximum Params and FLOPs reduction (i.e., the maximum global group sparsity) for a model with a negligible accuracy drop?

Answer 30 · 2023-04-05T15:54:19.000Z

@songkq

Great! For sure, please use the below commands.

full_flops = oto.compute_flops()
compressed_flops = oto.compute_flops(compressed=True) # call after compression, otherwise may raise error
full_num_params = oto.compute_num_params()
compressed_num_params = oto.compute_num_params(compressed=True) # call after compression, otherwise may raise error

print("Full FLOPs (M): {f_flops:.2f}. Compressed FLOPs (M): {c_flops:.2f}. Reduction Ratio: {f_ratio:.4f}"\
      .format(f_flops=full_flops, c_flops=compressed_flops, f_ratio=1 - compressed_flops/full_flops))
print("Full # Params: {f_params}. Compressed # Params: {c_params}. Reduction Ratio: {f_ratio:.4f}"\
      .format(f_params=full_num_params, c_params=compressed_num_params, f_ratio=1 - compressed_num_params/full_num_params))

Answer 31 · 2023-04-05T23:44:24.000Z

@tianyic. Thanks.
Actually, I mean is there a deterministic way to quickly evaluate the maximum pruning ratio that can be set while considering that the pruned model accuracy is almost unaffected compared with that of before pruning.

Answer 32 · 2023-04-06T02:24:29.000Z

@tianyic Hi, I found that bias=False is not the root cause for this issue. Maybe the version of torch (torch=1.8.1) or the default opset version cause the problem. When I try bias=False with torch=1.11.0+cu113 and onnx=1.10.1, everything is OK.

I still doubt the transpose and reshape operation under different opset version cause the problem. If possible, the opset version can be set as an optional configuration of OTO.

x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1).permute(0, 2, 1)

torch = 1.11.0 with bias=False

torch = 1.8.1 with bias=False

However, when I check the _export_onnx_opset_version used in _optimize_trace, torch1.11.0 and torch1.8.1 have the same _export_onnx_opset_version. I'm so confusing about this ...

def _optimize_trace(graph, operator_export_type):
    from torch.onnx import utils
    return utils._optimize_graph(graph, operator_export_type)

# utils._optimize_graph
from torch.onnx.symbolic_helper import _onnx_shape_inference, _export_onnx_opset_version
torch._C._jit_pass_onnx_scalar_type_analysis(graph, True, _export_onnx_opset_version)
torch._C._jit_pass_onnx_peephole(graph, _export_onnx_opset_version, fixed_batch_size)
if _onnx_shape_inference:
        torch._C._jit_pass_onnx_graph_shape_type_inference(graph, params_dict, _export_onnx_opset_version)

# torch1.11.0
_default_onnx_opset_version = 9
_onnx_main_opset = 15
_onnx_stable_opsets = [7, 8, 9, 10, 11, 12, 13, 14]
_export_onnx_opset_version = _default_onnx_opset_version
_constant_folding_opset_versions = list(range(9, _onnx_main_opset + 1))

# torch1.8.1
_default_onnx_opset_version = 9
_onnx_main_opset = 13
_onnx_stable_opsets = [7, 8, 9, 10, 11, 12]
_export_onnx_opset_version = _default_onnx_opset_version

Answer 33 · 2023-04-06T03:34:23.000Z

@songkq Thanks for your deep dive.

The root cause is the tensor misalignment which seems caused by varying torch and onnx versions, where I proceeded a quick fix to make the library more robust against varying versions. Since the OTO is touching a brand new autoML area, some necessary and public APIs are lacked in torch and onnx, which are made up by me based on some logic thereby may exist corner case. But I believe it will become more reliable and robust following the development of the whole community :)

Please try again after git pull with bias=False.

I will add opt_version in the next release.

BTW, the current version requires the end-users give target group sparsity level. We usually start with 70%, then up and down to 90% or 50% depending on the performance that 70% group sparsity could reach. How to automatically select the target group sparsity level without sacrificing performance would leave as future work.

Answer 34 · 2023-04-06T07:05:12.000Z

@tianyic Thanks for the fix. However, it doesn't work with torch=1.8.1 and onnx=1.10.1. Maybe a bug in torch1.8.1.

Although the bug in torch1.8.1, I've verified the effectiveness of OTO in my case with a target_group_sparsity=0.1, where the pruned model has a negligible accuracy drop. Good job~
I will try to enlarge the target_group_sparsity with oto=2.0.10 and torch=1.11.0 later.

Answer 35 · 2023-04-06T17:09:47.000Z

Great that works for your case. I will update the readme regarding the torch dependencies.

Answer 36 · 2023-04-07T01:54:41.000Z

@tianyic Hi, do you have a plan to introduce this speciality that add some functionality to round the number of pruned channels to the expected number (32 or 16 or 8, refer to https://github.com/VainF/Torch-Pruning)? If so, it will be very useful for speeding up the inference of pruned model on edge devices such as NPU.
One more thing, as you said "we usually start with 70%, then up and down to 90% or 50% depending on the performance that 70% group sparsity could reach.", does the 70% group sparsity mean target_group_sparsity=0.7 or 0.3?

Answer 37 · 2023-04-07T02:39:03.000Z

@songkq You could set the group_divisible in dhspg optimizer to be 8, 16, 32 if you want the remaining ZIGs to be divided by 8, 16, 32.

Yes, I know and appreciate the torch pruning. You might notice that both frameworks currently have pros and cons. In short, torch pruning could generate pruned model in the format of torch, while is still multi-stage procedure. OTO generated a product-ready pruned model in onnx format from scratch in the one shot manner. OTO is more like an end-to-end automatic general DNN training and compression framework.

Yes, I referred to start with target group sparsity as 0.7.

Answer 38 · 2023-04-07T03:20:43.000Z

@tianyic Thanks. I'll try group_divisible. Yeah, OTO is much more user-friendly than some other pruning tools. In my case, it seems that OTO performs more powerful than torch-pruning considering the pruned model accuracy drop.
Could you please provide a benchmark about the trade-off between the target group sparsity and pruned model accuracy for various downstream tasks, such as YOLO series? This work could reach the global maximum pruning ratio to 30%~40% for the downstream models with negligible performance drop (https://github.com/HankYe/PAGCP). It seems that target_group_sparsity=0.7 is amazing for real-world tasks.

Answer 39 · 2023-04-07T04:04:24.000Z

@songkq Thanks for the kind words. The accuracy preservation of OTO is due to our mathematical background especially the expertise in sparse optimization, which is the fundamental problem for pruning tasks. One-shot method will eventually become the main trend since besides user-friendness it has more advantages and possibilities in mathematics which could not be easily brought via multi-stage method.

For benchmarking down-stream tasks, our current bandwidth is limited, especially we are focusing on the development of next generation of OTO. Currently, maintaining the OTOv2 open-source library has reached our workload limit. Therefore, we may not be able to do it by ourselves perhaps by the end of this year but are open to the contributions from the community.

Answer 40 · 2023-04-07T04:24:47.000Z

@tianyic Will next generation of OTO support Transformer structure compression? Looking forward to it.

Answer 41 · 2023-04-07T04:52:48.000Z

@songkq Thanks.

The next generation of OTO would be on another vertical. The vanilla support of transformer could be considered as an extension within the current OTOv2, which is actually ongoing for the PR. The key is to support the matmul operator. But we have not merged this PR yet since it hasn't rigorously considered the bias stored in the add operator yet.

Another reason that we do not urgently push the transformer support is because of the standard structure pruning more easily causing regression on transformer compared with CNNs. You might notice that some recent pruning works claim achieving negligible performance regression on transformer while are typically unstructured pruning so that are useless in reality. We believe low-rank analysis should be leveraged into transformer pruning, thereby postponing the transformer support or more precisely matmul and add bias support till when we have sufficient bandwidth to fundamentally solve that problem.

Answer 42 · 2023-04-14T01:56:02.000Z

@tianyic Hi, I'm wondering if I can recover the activation of those ZIGs that have been pruned. For example, target_group_sparsity was set to 0.7 during the first training, and now I want to recover the model target_group_sparsity to 0.5 during finetuning.

Answer 43 · 2023-04-14T03:40:05.000Z

@tianyic When optimizer.step(), RuntimeError occurred after optimizer.load_state_dict from a checkpoint for resuming. Could you please give some advice?

"only_train_once/optimizer/dhspg.py", line 102, in get_first_momentum_grad
buf.mul_(momentum).add_(grad, alpha=(1.0-dampening))
RuntimeError: The size of tensor a (32) must match the size of tensor b (3) at non-singleton dimension 3

Answer 44 · 2023-04-14T16:21:28.000Z

@songkq That is a great point. Though we currently do not have that feature yet, it can be definitely doable via modifying the optimizer.

Regarding the error of loading optimizer's state_dict, this function is rarely used for our end, since the oto could be resumed via

oto = OTO(model=latest_model, dummy_input)

We will spend time to test and fix it.

Answer 45 · 2023-05-18T02:48:08.000Z

@tianyic Hi, thanks for the wonderful work. I just notice that nn.Upsample is listed in the supported operation list, but when I try to use the operation, it shows that the Unknown op: resize occur, and further meet an error in graph.py, line333 (pruned_onnx_param = numpy_param[:, incoming_cc.non_zero_group_idxes, …]) for IndexError when calling oto.compress()

Do you have any idea about this problem or any advice for any upsample operation that can be used? Thanks!

Answer 46 · 2023-05-18T02:57:30.000Z

@fordevoted Thanks for reaching out. We have tried OTOv2 on a few Unets and super-resolution models which archs have upsamplers, and OTOv2 worked pretty well. My gut feeling is that maybe something else caused the errors.

If possible, please share the model script and dummy input for me. I will take a look upon my bandwidth. The issue can be typically resolved by slightly changing the model architecture. If any confidential, please share to my email address: tiachen@microsoft.com.

Answer 47 · 2023-05-18T06:37:22.000Z

@tianyic thanks for the information, I just further debugging and find the problem is similar to above. After the modification, the issue is resolved. Thanks!