A question regarding the implementation of 3D-aware convolution

Question

A question regarding the implementation of 3D-aware convolution

Closed this issue a year ago · 6 comments

Hi! Thanks for your inspiring work.
I have a small question related to the implementation of 3D-aware convolution.

Given your implementation in networks_stylegan2.py, i.e.

def aware3d(x):
    if isinstance(x, list):
        x_xy, x_yz, x_zx = x
        B, _, H, W = x_xy.shape
        B *= 3
    else:
        x_ = x.view(-1, 3, x.shape[1], x.shape[2], x.shape[3])
        x_xy, x_yz, x_zx = x_[:, 0], x_[:, 1], x_[:, 2]
        B, _, H, W = x.shape
    x_zy = x_yz.permute(0,1,3,2)
    x_xz = x_zx.permute(0,1,3,2)
    x_yx = x_xy.permute(0,1,3,2)

    x_zy_pz = x_zy.mean(dim=-1, keepdim=True).repeat(1,1,1,x_xy.shape[-1])
    x_xz_pz = x_xz.mean(dim=-2, keepdim=True).repeat(1,1,x_xy.shape[-2],1)
    x_xy_ = torch.cat([x_xy, x_zy_pz, x_xz_pz], 1)

    x_yx_px = x_yx.mean(dim=-2, keepdim=True).repeat(1,1,x_yz.shape[-2],1)
    x_xz_px = x_xz.mean(dim=-1, keepdim=True).repeat(1,1,1,x_yz.shape[-1])
    x_yz_ = torch.cat([x_yx_px, x_yz, x_xz_px], 1)

    x_yx_py = x_yx.mean(dim=-1, keepdim=True).repeat(1,1,1,x_zx.shape[-1])
    x_zy_py = x_zy.mean(dim=-2, keepdim=True).repeat(1,1,x_zx.shape[-2],1)
    x_zx_ = torch.cat([x_yx_py, x_zy_py, x_zx], 1)

    x = torch.cat([x_xy_[:, None], x_yz_[:, None], x_zx_[:, None]], 1).view(B, -1, H, W)
    return x

According to the paper, for example, when you are operating on the xy-plane, the yz-plane and zx-plane should contribute y-vector (pooling z-axis) and x-vector (pooling z-axis) respectively.
However, in the implementation, x_zy_pz = x_zy.mean(dim=-1, ...) actually pools along the y-axis.
Is it right? Or, is there any misunderstanding with me?

Expect for response. Thanks in advance. :)

Answer 1 · 2023-08-09T03:07:31.000Z

Hi, x_zy is $P_{yz}$ in Fig. 4(b), where $z$ axis corresponds to W dimension in a BCHWtensor. Hence, when setting dim=-1, it relates to the $z$ axis.

Answer 2 · 2023-08-09T05:55:59.000Z

Hi, thanks for the response.
However, according to the training/volumetric_rendering/renderer.py, i.e.

def project_onto_planes(coordinates):
    """
    Does a projection of a 3D point onto a batch of 2D planes,
    returning 2D plane coordinates.

    Takes plane axes of shape n_planes, 3, 3
    # Takes coordinates of shape N, M, 3
    # returns projections of shape N*n_planes, M, 2
    """
    # planes = generate_planes().to(coordinates.device)
    # N, M, C = coordinates.shape
    # n_planes, _, _ = planes.shape
    # coordinates = coordinates.unsqueeze(1).expand(-1, n_planes, -1, -1).reshape(N*n_planes, M, 3)
    # inv_planes = torch.linalg.inv(planes).unsqueeze(0).expand(N, -1, -1, -1).reshape(N*n_planes, 3, 3)
    # projections = torch.bmm(coordinates, inv_planes)
    # return projections[..., :2]

    N, M, _ = coordinates.shape
    xy_coords = coordinates[..., [0, 1]]
    yz_coords = coordinates[..., [1, 2]]
    zx_coords = coordinates[..., [2, 0]]
    return torch.stack([xy_coords, yz_coords, zx_coords], dim=1).reshape(N*3, M, 2)

Your at beginning defined variable x_xy has $x$ axis, which corresponds to H dimension, and $y$ axis, which corresponds to W dimension. Similarly, x_yz has $y$ axis, which corresponds to H dimension, and $z$ axis, which corresponds to W dimension.

Since x_zy is permuted from x_yz, its $z$ axis should correspond to H dimension instead of W dimension. :)

Answer 3 · 2023-08-11T03:00:57.000Z

Hi, regarding the grid_sample function, the x in the sampling coordinates (x, y) corresponds to the horizontal axis, specifically denoted as W. There exists a Chinese blog post explaining this particular feature: https://blog.csdn.net/weixin_45657478/article/details/128080374

Answer 4 · 2023-08-11T04:20:53.000Z

Hi, I saw my mis understanding. Thanks! I didn't expect this particular feature.

Besides, I still hold another question regarding this implementation. Based on the training/networks_stylegan2, the x_xy_, x_yz_, and x_zx_ share the same convolutional weights, despite different styles.

However, for example, the first $32$ channels of x_xy_ correspond to x_xy, but those of x_yz_ correspond to x_yx as transposed x_xy despite pooling. Therefore, the first $32$ channels of weights apply to both original x_xy and its transpose. The weights are not necessarily symmetric, and the convolution itself is not tranpose-invariant.

Does this matter?

Answer 5 · 2023-08-11T06:25:54.000Z

It does not matter as the weights is modulated by different styles.

Answer 6 · 2023-08-11T07:09:37.000Z

I understand. Thanks for your patient explanation. :)