tianyic/only_train_once

oto.compress failed with "xs.append(param.data.view(cc.num_groups, -1))" in graphy.py

Closed this issue · 47 comments

songkq commented

@tianyic Hi, when I tried OTO with the following case, oto.compress failed. Could you please give some advice?

import torch
import torch.nn as nn
from only_train_once import OTO


class DemoNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()

        
        self.fc = nn.Sequential(
            nn.Linear(1024, 512),
            nn.Linear(512, 256)
        )

    def forward(self, x):

        # x: [1, 512, 2, 81]
        x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
        x = x.squeeze(-1)
        return self.fc(x)

if __name__ == "__main__":
    
    model = DemoNet()
    model.eval()
    fake_input = torch.randn((1, 512, 2, 81))
    print(f"{model(fake_input).shape}")
    oto = OTO(model=model, dummy_input=fake_input)
    oto.compress()

Thanks for reaching out. I have taken a quick look. It seems that the below lines (which seems a bit unnecessary for me).

x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1)

changes the torch trace graph's construction, especially the stem vertices type from linear or gemm to matmul. But the matmul is currently not included into the supported operators yet, which caused the trouble.

See the below dependency graph with normal input for linear layer fake_input = torch.randn((1, 1024))

image

def forward(self, x):
      return self.fc(x)

fake_input = torch.randn((1, 1024))
oto = OTO(model=model, dummy_input=fake_input)

versus the dependency graph under the exampling numerous preprocessing operators

image

def forward(self, x):
      x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
      x = x.squeeze(-1)
      return self.fc(x)

fake_input = torch.randn((1, 512, 2, 81))
oto = OTO(model=model, dummy_input=fake_input)

Please see my below comments regarding how to utilize OTO more properly.

  • Set ZIGs to be zero before compression. To get a compressed model promptly, besides training the model via DHSPG to yield highly group sparse solution in the manner of ZIG, we can also randomly set a subset of ZIGs being zero. Then call the compress API. For example,
import torch
import torch.nn as nn
from only_train_once import OTO

class DemoNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(1024, 512),
            nn.Linear(512, 256)
        )

    def forward(self, x):
        return self.fc(x)

if __name__ == "__main__":
    
    model = DemoNet()
    model.eval()
    fake_input = torch.randn((1, 1024))
    print(f"{model(fake_input).shape}")
    oto = OTO(model=model, dummy_input=fake_input)
    oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()
  • Leverage Dependency Graph Visualization to visualize the dependency graphs which usually promptly reveal the root cause of potential failures. See the above oto.visualize_zigs(view=False), where a $model_name.pdf would be generated.

  • Check if the operator displayed in dependency graph is supported by OTOv2.

Hope the above help. Meanwhile, we are working on developing the next generation of the library and will keep adding more tutorials and documentations. Thanks for the usage of our tool! Feel free to leave any other feedback.

songkq commented

@tianyic Thanks.
It seems that conv1d is exactly an alternative choice. By the way, I'm wondering whether we can configure OTO with a black list, where the unsupported operators can be automatically ignored and kept intact during pruning.
Also I think if we add some functionality to round the number of pruned channels to the expected number (32 or 16 or 8, for example), it will be useful for deployment on edge devices such as NPU.

image

image

class DemoNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()

        self.conv1d = nn.Sequential(
            nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
            nn.Conv1d(512, 256, 1, 1, 0, bias=True)
        )

    def forward(self, x):

        # x: [1, 512, 2, 81]
        x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
        x = x.squeeze(-1).permute(0, 2, 1)
        return self.conv1d(x)


if __name__ == "__main__":

    model = DemoNet()
    model.eval()
    fake_input = torch.randn((1, 512, 2, 81))
    print(f"{model(fake_input).shape}")
    oto = OTO(model=model, dummy_input=fake_input)
    oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()
songkq commented

@tianyic How can I configure the parameters of oto.dhspg when using the Adamw optimizer?

Glad that you find some alternative operators to make the library work. The black list is a good idea. We will consider upon our bandwidth.

An official tutorial regarding applications with Adam and AdamW will be provided in about 2-3 weeks. For a hotfix for your question, please try the below optimizer setting.

optimizer = oto.dhspg(
        variant="adamw",
        lr=1e-3, # set same as the baseline training
        weight_decay=1e-2,  # set same as the baseline training
        first_momentum=0.9, # set same as the baseline training
        second_momentum=0.999, # set same as the baseline training
        dampening=0.0, # set same as the baseline training
        target_group_sparsity=0.8,  # choose upon how much you wanna compress
        start_pruning_steps=X* len(trainloader), # start pruning after X epochs, depends on total epochs, start pruning 1/5 total epochs is typically fine.
        lmbda=1e-2, # larger value promote group sparsity more effectively
        lmbda_amplify=20, # larger value promote group sparsity more effectively
        hat_lmbda_coeff=1e3, # larger value promote group sparsity more effectively
        epsilon=0.0  # enlarge it if group sparsity does not meet target_group_sparsity. 
    )
songkq commented

@tianyic Thanks.
When I execute the step 2 of the pipeline, will the learned group_sparsity in the step 1 be reset from scratch?
1、oto training -> save model & optimizer checkpoint ->stop training
2、load checkpoint -> oto training resume -> oto.compress

Also I'm wondering if I can export the pruned model onnx through the pipeline:
1、oto training -> save model & optimizer checkpoint ->stop training
2、load checkpoint -> oto.compress

Before reaching the start_pruning_steps, what are the differences by using the oto.dhspg optimizer compared with the original torch AdamW optimizer? How does the start_pruning_steps make effects on the accuracy of the pruned model?
Which parameter of oto.dhspg will be dominant to the accuracy of the pruned model?

Both pipelines are supported. But for the first pipeline, to preserve the learned group sparsity, need to set the argument in dhspg optimizer fixed_zero_groups=True, then resumes the oto training.

One more trick, not sure if you met or not. During the pruning when the group sparsity is increasing, the loss function may regress a bit upon applications. If so, don’t worry, once the group sparsity reaches the target, the loss function will decrease again till ultimate convergence.

This is a good question regarding start_pruning_steps, which we will come up with detailed explanations regarding DHSPG, maybe a video tutorial.

In short, DHSPG is a hybrid optimizer, it applies the baseline optimizer over all variables before starting pruning and over the variables that are considered as potentially important during pruning. For the variables that are considered as maybe redundant, a step called Half-space step is proceeded to yield them onto zero. Once group sparsity reaches the target, the optimizer performs as the baseline optimizer till ultimate convergence.

The ultimate accuracy typically depends on 1. how the baseline model can achieve, 2. if gives fairly enough steps for warming up, and 3.if gives sufficiently many steps after reaching target group sparsity.

More documentations and tutorials will be provided, where we will show more detailed instructions.

songkq commented

@tianyic Thanks. Looking forward to the tutorials.
It seems that DHSPG optimizer slows down compared with the torch adamw optimizer. Could you please give some advice for speeding up the optimizer?

A good question.

DHSPG optimizer is a hybrid optimizer which indeed has some computational overhead during pruning (when group sparsity is increasing). The overhead is typically varying upon model and dataset. For majority models, the overhead is negligible, but some are not (the worst case I met would double the cost). But remark here that the overhead is temporary and will disappear once group sparsity reaches the target value (afterwards the DHSPG performs the same as the baseline optimizer).

Therefore, to speed up, I would suggest shrinking the pruning procedure, i.e., to make the group sparsity increase faster to reach the target value, which can be typically achieved via fine-tuning the hyperparameters related to group sparsity exploration. In fact, most of experiments I conducted could shrink the pruning stage into just a few epochs, which largely mitigates the overhead.
Meanwhile, there might be some engineering tricks in the official torch version that could be leveraged to further speedup the DHSPG.

Hope the above help.

songkq commented

@tianyic Hi, when I apply OTO to a c2f module used in YOLOv8, it failed with the error about the slice and concat operations. How can I solve the problem?

Traceback (most recent call last):
  File "test_oto_c2f.py", line 117, in <module>
    oto = OTO(model=model, dummy_input=fake_input)
  File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/__init__.py", line 17, in __init__
    self.partition_zigs()
  File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/__init__.py", line 28, in partition_zigs
    self._graph = automated_partition_zigs(self._graph)
  File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/zig/zig.py", line 125, in automated_partition_zigs
    graph.set_zigs(opt)
  File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/graph/graph.py", line 417, in set_zigs
    dfs_helper(self, auxilary_cc, auxilary_cc.dependent_stem_ccs)
  File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/graph/graph.py", line 410, in dfs_helper
    node_in = graph.nodes[node_in_id]
KeyError: 'out-28'
[debug] concat_node.inputs = ['out-28', 'out-29', 'out-35']
[debug] graph.nodes = dict_keys(['out-25', 'out-26', 'out-27', 'out-28-29', 'out-30', 'out-31', 'out-32', 'out-33', 'out-34', 'out-35', 'out-36', 'out-37', 'out-38', 'out-39'])
from typing import Callable
import torch
import torch.nn as nn
from functools import partial
from only_train_once import OTO

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1)  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):  # ch_in, ch_out, shortcut, groups, kernels, expand
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))


class C2f(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, 2 * self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        # slice
        y = list(self.cv1(x).chunk(2, 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

    def forward_split(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

class C2fModule(nn.Module):

    def __init__(self, c1=512, c2=256):
        super().__init__()
        self.c2f = C2f(c1, c2, n=1, shortcut=False, g=1, e=0.5)

    def forward(self, x):
        return self.c2f(x)



if __name__ == "__main__":

    model = C2fModule()
    model.eval()
    fake_input = torch.randn((1, 512, 4, 80))
    print(f"{model(fake_input).shape}")
    oto = OTO(model=model, dummy_input=fake_input)
    oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()

Thanks for the above example @songkq . Will take a look in the weekdays and provide a guidance later.

Thanks for the example @songkq. I have taken a quick look. We will support slice operator better in the future release.

For a hotfix, please see the below alternative way that avoids slice, where I decompose the conv following slice into two separate convs.

import torch
import torch.nn as nn
from only_train_once import OTO

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1)  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=True)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):  # ch_in, ch_out, shortcut, groups, kernels, expand
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))


class C2f(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, self.c, 1, 1)
        self.cv2 = Conv(c1, self.c, 1, 1)
        self.cv3 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        y = [self.cv1(x), self.cv2(x)]
        y.extend(m(y[-1]) for m in self.m)
        return self.cv3(torch.cat(y, 1))

    def forward_split(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

class C2fModule(nn.Module):
    def __init__(self, c1=512, c2=256):
        super().__init__()
        self.c2f = C2f(c1, c2, n=1, shortcut=False, g=1, e=0.5)

    def forward(self, x):
        return self.c2f(x)

if __name__ == "__main__":

    model = C2fModule()
    model.eval()
    fake_input = torch.randn((1, 512, 4, 80))
    print(f"{model(fake_input).shape}")
    oto = OTO(model=model, dummy_input=fake_input)
    oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()

    import onnxruntime as ort
    full_ort_sess = ort.InferenceSession(oto.full_model_path)
    compress_ort_sess = ort.InferenceSession(oto.compressed_model_path)
    
    full_output = full_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
    compress_output = compress_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
    print("Output difference:")
    print(full_output - compress_output)

The full and compressed models yield the same outputs. Hope the above help.

songkq commented

@tianyic Thanks. I will try it out.
I met another problem that group_sparsity, omega = optimizer.compute_group_sparsity_omega(), where the returned group_sparsity keeps always zero during training even after reaching the settingstart_pruning_steps. I set the oto.dhspg optimizer as the following. I'm confusing why oto didn't take effect.

target_group_sparsity: 0.1
start_pruning_steps: 1000
hat_lmbda_coeff: 10.0
lmbda: 0.001
lmbda_amplify: 2.0

optimizer = oto.dhspg(
        variant="adamw",
        lr=1e-3,
        weight_decay=1e-2,  
        first_momentum=0.9, 
        second_momentum=0.999, 
        dampening=0.0,
        target_group_sparsity=0.1,  
        start_pruning_steps=1000, 
        lmbda=1e-3, 
        lmbda_amplify=2.0, 
        hat_lmbda_coeff=10,
        epsilon=0.95
    )

A good question @songkq . It is largely due to the settings of hyperparameter. Adamw and sgd they typically requires different settings for lambda (group sparsity exploration) related due to the different gradient estimation mechanisms. Please take a try as the below. We will cover it in the coming tutorials.

Meanwhile, we have ongoing plan to further optimize and simplify the hyperparameter lists to bring more convenience for the users including ourselves (since we are actively applying OTO onto a lot of DNN application-track research and products).

optimizer = oto.dhspg(
        variant="adamw",
        lr=1e-3, 
        weight_decay=1e-2,  
        first_momentum=0.9, 
        second_momentum=0.999,
        dampening=0.0, 
        target_group_sparsity=0.1,  
        start_pruning_steps=1000, 
        lmbda=1e-2, # larger value promote group sparsity more effectively
        lmbda_amplify=20, # larger value promote group sparsity more effectively
        hat_lmbda_coeff=1e3, # larger value promote group sparsity more effectively
        epsilon=0.95  # larger value promote group sparsity more effectively 
    )

I updated the repo for auto selecting hyper-parameter for different variants. Could just set the optimizer up as

optimizer = oto.dhspg(
        variant="adamw",
        lr=1e-3, 
        target_group_sparsity=0.1,  
        start_pruning_steps=1000, 
    )

should work for the majority of the experiments.

songkq commented

@tianyic Good job. Thanks!

songkq commented

@tianyic Hi, I have attempted to prune my network with a target_group_sparsity of 0.1/0.35/0.5. However, I found that only the last two layers self.conv1d were pruned while failing to prune the cnn_backbone. I'm confusing why oto cannot globally prune the network.

class DemoNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()

        self.cnn_backbone = ...
        self.conv1d = nn.Sequential(
            nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
            nn.Conv1d(512, 256, 1, 1, 0, bias=True)
        )

    def forward(self, x):

        x = self.cnn_backbone(x)
        # x: [1, 512, 2, 81]
        x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
        x = x.squeeze(-1).permute(0, 2, 1)
        return self.conv1d(x)

@songkq Could you please share the dependency graph for me? I can then take a quick look.

You are right OTO is working on globally pruning the whole networks. Your issue would be typically resolved via minor adjustments either to the network arch or the operator list in my gut feeling.

Just in case if the dependency graph is confidential, you could send it via email Tianyi.Chen@microsoft.com

Meanwhile, I would recommend to proceed with sanity check before engaging into DHSPG training @songkq . The sanity check would randomly set up a set of ZIGs to be zero, and a compressed model would be generated afterwards. If the compressed model looked normal and returned the exact same output as the full model given the same random input, then the sanity check passed. Afterwards, DHSPG is triggered to train and identify redundant groups from the view of optimization rather than random selection.

oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
oto.compress()

import onnxruntime as ort
full_ort_sess = ort.InferenceSession(oto.full_model_path)
compress_ort_sess = ort.InferenceSession(oto.compressed_model_path)
    
full_output = full_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
compress_output = compress_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
print("Output difference:")
print(full_output - compress_output) # Should be merely as zeros. 

@songkq Please take a look at this newly raised issue, which I suspect might be the similar situation as yours. If so, please let me know if your onnx version is also 1.14. Thanks.

#13

songkq commented

@tianyic Thanks.
I have done the sanity check. It exactly shows that only the last two layers were pruned by oto.random_set_zero_groups() and oto.compress(). The maximum value of diffrence between full_output and compress_output is about 4.4703484e-08. I'm wondering if the reshape and transpose operation cause the problem.

x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1).permute(0, 2, 1)

testcase:

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1)  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):  # ch_in, ch_out, shortcut, groups, kernels, expand
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))


class C2f_rep(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()

        self.kwargs = {"c1": c1, "c2": c2, "n": n, "shortcut": shortcut, "g": g, "e": e}

        self.c = int(c2 * e)  # hidden channels
        self.cv0 = Conv(c1, self.c, 1, 1)
        self.cv1 = Conv(c1, self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        # slice
        # y = list(self.cv1(x).chunk(2, 1))
        y = [self.cv0(x), self.cv1(x)]
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

    def forward_split(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

class rC2fModule(nn.Module):

    def __init__(self, c1=512, c2=256):
        super().__init__()

        self.c2f = C2f_rep(c1, c2, n=1, shortcut=False, g=1, e=0.5)

    def forward(self, x):

        return self.c2f(x)

class DemoC2fNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        
        self.c2f = rC2fModule(c1=512, c2=512)
        self.conv1d = nn.Sequential(
            nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
            nn.Conv1d(512, 256, 1, 1, 0, bias=True)
        )

    def forward(self, x):

        x = self.c2f(x)
        # x: [1, 512, 2, 81]
        x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
        x = x.squeeze(-1).permute(0, 2, 1)
        return self.conv1d(x)


if __name__ == "__main__":

    model = DemoC2fNet()
    model.eval()
    fake_input = torch.randn((1, 512, 2, 81))

    oto = OTO(model=model, dummy_input=fake_input)
    # oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()

    exit()

My envs:

torch == 1.8.1
onnx == 1.10.1

Thanks for sharing @songkq . Will take a look this week. Quite occupied the early of this week.

@songkq Thanks for the example. I took a quick look at this example. There exists some tensor alignment issues due to the discrepancy among different dependencies version. For your case, could you please try to enable the bias as True. We will proceed more rigorous improvements to make the tensor alignment to be more robust against varying dependencies. For more reliably using OTO, I do suggest enabling bias for layers as True, also for normalization layers such as BN set their affine as True.

self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=True)

Then run the sanity check again

def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1)  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=True)
        self.bn = nn.BatchNorm2d(c2)
        self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))

class Bottleneck(nn.Module):
    # Standard bottleneck
    def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5):  # ch_in, ch_out, shortcut, groups, kernels, expand
        super().__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, k[0], 1)
        self.cv2 = Conv(c_, c2, k[1], 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))


class C2f_rep(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()

        self.kwargs = {"c1": c1, "c2": c2, "n": n, "shortcut": shortcut, "g": g, "e": e}

        self.c = int(c2 * e)  # hidden channels
        self.cv0 = Conv(c1, self.c, 1, 1)
        self.cv1 = Conv(c1, self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        # slice
        # y = list(self.cv1(x).chunk(2, 1))
        y = [self.cv0(x), self.cv1(x)]
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

    def forward_split(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in self.m)
        return self.cv2(torch.cat(y, 1))

class rC2fModule(nn.Module):

    def __init__(self, c1=512, c2=256):
        super().__init__()

        self.c2f = C2f_rep(c1, c2, n=1, shortcut=False, g=1, e=0.5)

    def forward(self, x):

        return self.c2f(x)

class DemoC2fNet(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        
        self.c2f = rC2fModule(c1=512, c2=512)
        self.conv1d = nn.Sequential(
            nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
            nn.Conv1d(512, 256, 1, 1, 0, bias=True)
        )

    def forward(self, x):

        x = self.c2f(x)

        # # x: [1, 512, 2, 81]
        x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
        x = x.squeeze(-1).permute(0, 2, 1)
        return self.conv1d(x)


if __name__ == "__main__":

    model = DemoC2fNet()
    # model = rC2fModule()
    model.eval()
    fake_input = torch.randn((1, 512, 2, 81))

    oto = OTO(model=model, dummy_input=fake_input)
    oto.visualize_zigs(view=False)
    oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
    oto.compress()
    import onnxruntime as ort
    full_ort_sess = ort.InferenceSession(oto.full_model_path)
    compress_ort_sess = ort.InferenceSession(oto.compressed_model_path)
    
    full_output = full_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
    compress_output = compress_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
    print(full_output - compress_output)

It passed on my end, where the maximum difference between full and compressed models is 1e-7.

@songkq I attached the full and compressed models during one sanity check at Baidu Pan.

链接: https://pan.baidu.com/s/15i-8p_8Ko2R6YGzeT5FGdw 提取码: np46

My experiment setting is torch 1.13, onnx=1.12.

songkq commented

@tianyic Thanks.
Since normalization layers such as BN have been used in the model, the bias of nn.Conv2d is always set as False. I'm wondering if I can set the bias as True while freezing them as zeros during training considering the compatibility of using OTO. Then during inference, the bias of Conv2d can be merged with BN as normal.
Will this way has an influence of the model accuracy when using the oto.dhspg optimizer?

@songkq

This is a great question. For short-term, setting bias as True should not deliver worse result. Please see the below explanations.

From the view of optimization, if bias = 0 is indeed optimal, then we make it trainable, during training, it should converge to zero eventually. In words, there would be no huge difference between bias as False and True if bias = 0 being optimal. But if optimal bias does not equal to 0, make it trainable could help achieving more optimal solution.

On the other hand, Conv-BN fusion works for non-zero bias as well. Thus, it does not matter that much to fuse bias as True/False or fix them as zero during the training. We have applied OTO onto a lot low-level and high-level vision models. Majority of them could achieve competitive performance to the full models with significant FLOPs reduction, a few of them even outperforms full versions.

The root cause of this issue is due to some tensor alignment between onnx file and torch model, which we will make rigorous fix. Therefore, for long-term, please wait for our fix.

songkq commented

@tianyic Thanks. I will try it out with the bias = True in my case.
By the way, is there a fast way to estimate the maximum Params and FLOPs reduction (i.e., the maximum global group sparsity) for a model with a negligible accuracy drop?

@songkq

Great! For sure, please use the below commands.

full_flops = oto.compute_flops()
compressed_flops = oto.compute_flops(compressed=True) # call after compression, otherwise may raise error
full_num_params = oto.compute_num_params()
compressed_num_params = oto.compute_num_params(compressed=True) # call after compression, otherwise may raise error

print("Full FLOPs (M): {f_flops:.2f}. Compressed FLOPs (M): {c_flops:.2f}. Reduction Ratio: {f_ratio:.4f}"\
      .format(f_flops=full_flops, c_flops=compressed_flops, f_ratio=1 - compressed_flops/full_flops))
print("Full # Params: {f_params}. Compressed # Params: {c_params}. Reduction Ratio: {f_ratio:.4f}"\
      .format(f_params=full_num_params, c_params=compressed_num_params, f_ratio=1 - compressed_num_params/full_num_params))
songkq commented

@tianyic. Thanks.
Actually, I mean is there a deterministic way to quickly evaluate the maximum pruning ratio that can be set while considering that the pruned model accuracy is almost unaffected compared with that of before pruning.

songkq commented

@tianyic Hi, I found that bias=False is not the root cause for this issue. Maybe the version of torch (torch=1.8.1) or the default opset version cause the problem. When I try bias=False with torch=1.11.0+cu113 and onnx=1.10.1, everything is OK.

I still doubt the transpose and reshape operation under different opset version cause the problem. If possible, the opset version can be set as an optional configuration of OTO.

x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1).permute(0, 2, 1)

torch = 1.11.0 with bias=False
image

torch = 1.8.1 with bias=False
image

However, when I check the _export_onnx_opset_version used in _optimize_trace, torch1.11.0 and torch1.8.1 have the same _export_onnx_opset_version. I'm so confusing about this ...

def _optimize_trace(graph, operator_export_type):
    from torch.onnx import utils
    return utils._optimize_graph(graph, operator_export_type)

# utils._optimize_graph
from torch.onnx.symbolic_helper import _onnx_shape_inference, _export_onnx_opset_version
torch._C._jit_pass_onnx_scalar_type_analysis(graph, True, _export_onnx_opset_version)
torch._C._jit_pass_onnx_peephole(graph, _export_onnx_opset_version, fixed_batch_size)
if _onnx_shape_inference:
        torch._C._jit_pass_onnx_graph_shape_type_inference(graph, params_dict, _export_onnx_opset_version)
# torch1.11.0
_default_onnx_opset_version = 9
_onnx_main_opset = 15
_onnx_stable_opsets = [7, 8, 9, 10, 11, 12, 13, 14]
_export_onnx_opset_version = _default_onnx_opset_version
_constant_folding_opset_versions = list(range(9, _onnx_main_opset + 1))
# torch1.8.1
_default_onnx_opset_version = 9
_onnx_main_opset = 13
_onnx_stable_opsets = [7, 8, 9, 10, 11, 12]
_export_onnx_opset_version = _default_onnx_opset_version

@songkq Thanks for your deep dive.

The root cause is the tensor misalignment which seems caused by varying torch and onnx versions, where I proceeded a quick fix to make the library more robust against varying versions. Since the OTO is touching a brand new autoML area, some necessary and public APIs are lacked in torch and onnx, which are made up by me based on some logic thereby may exist corner case. But I believe it will become more reliable and robust following the development of the whole community :)

Please try again after git pull with bias=False.

I will add opt_version in the next release.

BTW, the current version requires the end-users give target group sparsity level. We usually start with 70%, then up and down to 90% or 50% depending on the performance that 70% group sparsity could reach. How to automatically select the target group sparsity level without sacrificing performance would leave as future work.

songkq commented

@tianyic Thanks for the fix. However, it doesn't work with torch=1.8.1 and onnx=1.10.1. Maybe a bug in torch1.8.1.

Although the bug in torch1.8.1, I've verified the effectiveness of OTO in my case with a target_group_sparsity=0.1, where the pruned model has a negligible accuracy drop. Good job~
I will try to enlarge the target_group_sparsity with oto=2.0.10 and torch=1.11.0 later.

Great that works for your case. I will update the readme regarding the torch dependencies.

songkq commented

@tianyic Hi, do you have a plan to introduce this speciality that add some functionality to round the number of pruned channels to the expected number (32 or 16 or 8, refer to https://github.com/VainF/Torch-Pruning)? If so, it will be very useful for speeding up the inference of pruned model on edge devices such as NPU.
One more thing, as you said "we usually start with 70%, then up and down to 90% or 50% depending on the performance that 70% group sparsity could reach.", does the 70% group sparsity mean target_group_sparsity=0.7 or 0.3?

@songkq You could set the group_divisible in dhspg optimizer to be 8, 16, 32 if you want the remaining ZIGs to be divided by 8, 16, 32.

Yes, I know and appreciate the torch pruning. You might notice that both frameworks currently have pros and cons. In short, torch pruning could generate pruned model in the format of torch, while is still multi-stage procedure. OTO generated a product-ready pruned model in onnx format from scratch in the one shot manner. OTO is more like an end-to-end automatic general DNN training and compression framework.

Yes, I referred to start with target group sparsity as 0.7.

songkq commented

@tianyic Thanks. I'll try group_divisible. Yeah, OTO is much more user-friendly than some other pruning tools. In my case, it seems that OTO performs more powerful than torch-pruning considering the pruned model accuracy drop.
Could you please provide a benchmark about the trade-off between the target group sparsity and pruned model accuracy for various downstream tasks, such as YOLO series? This work could reach the global maximum pruning ratio to 30%~40% for the downstream models with negligible performance drop (https://github.com/HankYe/PAGCP). It seems that target_group_sparsity=0.7 is amazing for real-world tasks.

@songkq Thanks for the kind words. The accuracy preservation of OTO is due to our mathematical background especially the expertise in sparse optimization, which is the fundamental problem for pruning tasks. One-shot method will eventually become the main trend since besides user-friendness it has more advantages and possibilities in mathematics which could not be easily brought via multi-stage method.

For benchmarking down-stream tasks, our current bandwidth is limited, especially we are focusing on the development of next generation of OTO. Currently, maintaining the OTOv2 open-source library has reached our workload limit. Therefore, we may not be able to do it by ourselves perhaps by the end of this year but are open to the contributions from the community.

songkq commented

@tianyic Will next generation of OTO support Transformer structure compression? Looking forward to it.

@songkq Thanks.

The next generation of OTO would be on another vertical. The vanilla support of transformer could be considered as an extension within the current OTOv2, which is actually ongoing for the PR. The key is to support the matmul operator. But we have not merged this PR yet since it hasn't rigorously considered the bias stored in the add operator yet.

Another reason that we do not urgently push the transformer support is because of the standard structure pruning more easily causing regression on transformer compared with CNNs. You might notice that some recent pruning works claim achieving negligible performance regression on transformer while are typically unstructured pruning so that are useless in reality. We believe low-rank analysis should be leveraged into transformer pruning, thereby postponing the transformer support or more precisely matmul and add bias support till when we have sufficient bandwidth to fundamentally solve that problem.

songkq commented

@tianyic Hi, I'm wondering if I can recover the activation of those ZIGs that have been pruned. For example, target_group_sparsity was set to 0.7 during the first training, and now I want to recover the model target_group_sparsity to 0.5 during finetuning.

songkq commented

@tianyic When optimizer.step(), RuntimeError occurred after optimizer.load_state_dict from a checkpoint for resuming. Could you please give some advice?

"only_train_once/optimizer/dhspg.py", line 102, in get_first_momentum_grad
buf.mul_(momentum).add_(grad, alpha=(1.0-dampening))
RuntimeError: The size of tensor a (32) must match the size of tensor b (3) at non-singleton dimension 3

@songkq That is a great point. Though we currently do not have that feature yet, it can be definitely doable via modifying the optimizer.

Regarding the error of loading optimizer's state_dict, this function is rarely used for our end, since the oto could be resumed via

oto = OTO(model=latest_model, dummy_input)

We will spend time to test and fix it.

@tianyic Hi, thanks for the wonderful work. I just notice that nn.Upsample is listed in the supported operation list, but when I try to use the operation, it shows that the Unknown op: resize occur, and further meet an error in graph.py, line333 (pruned_onnx_param = numpy_param[:, incoming_cc.non_zero_group_idxes, …]) for IndexError when calling oto.compress()

Do you have any idea about this problem or any advice for any upsample operation that can be used? Thanks!

@fordevoted Thanks for reaching out. We have tried OTOv2 on a few Unets and super-resolution models which archs have upsamplers, and OTOv2 worked pretty well. My gut feeling is that maybe something else caused the errors.

If possible, please share the model script and dummy input for me. I will take a look upon my bandwidth. The issue can be typically resolved by slightly changing the model architecture. If any confidential, please share to my email address: tiachen@microsoft.com.

@tianyic thanks for the information, I just further debugging and find the problem is similar to above. After the modification, the issue is resolved. Thanks!