A question regarding the implementation of 3D-aware convolution
Closed this issue · 6 comments
Hi! Thanks for your inspiring work.
I have a small question related to the implementation of 3D-aware convolution.
Given your implementation in networks_stylegan2.py
, i.e.
def aware3d(x):
if isinstance(x, list):
x_xy, x_yz, x_zx = x
B, _, H, W = x_xy.shape
B *= 3
else:
x_ = x.view(-1, 3, x.shape[1], x.shape[2], x.shape[3])
x_xy, x_yz, x_zx = x_[:, 0], x_[:, 1], x_[:, 2]
B, _, H, W = x.shape
x_zy = x_yz.permute(0,1,3,2)
x_xz = x_zx.permute(0,1,3,2)
x_yx = x_xy.permute(0,1,3,2)
x_zy_pz = x_zy.mean(dim=-1, keepdim=True).repeat(1,1,1,x_xy.shape[-1])
x_xz_pz = x_xz.mean(dim=-2, keepdim=True).repeat(1,1,x_xy.shape[-2],1)
x_xy_ = torch.cat([x_xy, x_zy_pz, x_xz_pz], 1)
x_yx_px = x_yx.mean(dim=-2, keepdim=True).repeat(1,1,x_yz.shape[-2],1)
x_xz_px = x_xz.mean(dim=-1, keepdim=True).repeat(1,1,1,x_yz.shape[-1])
x_yz_ = torch.cat([x_yx_px, x_yz, x_xz_px], 1)
x_yx_py = x_yx.mean(dim=-1, keepdim=True).repeat(1,1,1,x_zx.shape[-1])
x_zy_py = x_zy.mean(dim=-2, keepdim=True).repeat(1,1,x_zx.shape[-2],1)
x_zx_ = torch.cat([x_yx_py, x_zy_py, x_zx], 1)
x = torch.cat([x_xy_[:, None], x_yz_[:, None], x_zx_[:, None]], 1).view(B, -1, H, W)
return x
According to the paper, for example, when you are operating on the xy-plane, the yz-plane and zx-plane should contribute y-vector (pooling z-axis) and x-vector (pooling z-axis) respectively.
However, in the implementation, x_zy_pz = x_zy.mean(dim=-1, ...)
actually pools along the y-axis.
Is it right? Or, is there any misunderstanding with me?
Expect for response. Thanks in advance. :)
Hi, x_zy
is W
dimension in a BCHW
tensor. Hence, when setting dim=-1
, it relates to the
Hi, thanks for the response.
However, according to the training/volumetric_rendering/renderer.py
, i.e.
def project_onto_planes(coordinates):
"""
Does a projection of a 3D point onto a batch of 2D planes,
returning 2D plane coordinates.
Takes plane axes of shape n_planes, 3, 3
# Takes coordinates of shape N, M, 3
# returns projections of shape N*n_planes, M, 2
"""
# planes = generate_planes().to(coordinates.device)
# N, M, C = coordinates.shape
# n_planes, _, _ = planes.shape
# coordinates = coordinates.unsqueeze(1).expand(-1, n_planes, -1, -1).reshape(N*n_planes, M, 3)
# inv_planes = torch.linalg.inv(planes).unsqueeze(0).expand(N, -1, -1, -1).reshape(N*n_planes, 3, 3)
# projections = torch.bmm(coordinates, inv_planes)
# return projections[..., :2]
N, M, _ = coordinates.shape
xy_coords = coordinates[..., [0, 1]]
yz_coords = coordinates[..., [1, 2]]
zx_coords = coordinates[..., [2, 0]]
return torch.stack([xy_coords, yz_coords, zx_coords], dim=1).reshape(N*3, M, 2)
Your at beginning defined variable x_xy
has H
dimension, and W
dimension. Similarly, x_yz
has H
dimension, and W
dimension.
Since x_zy
is permuted from x_yz
, its H
dimension instead of W
dimension. :)
Hi, regarding the grid_sample
function, the x
in the sampling coordinates (x, y)
corresponds to the horizontal axis, specifically denoted as W
. There exists a Chinese blog post explaining this particular feature: https://blog.csdn.net/weixin_45657478/article/details/128080374
Hi, I saw my mis understanding. Thanks! I didn't expect this particular feature.
Besides, I still hold another question regarding this implementation. Based on the training/networks_stylegan2
, the x_xy_
, x_yz_
, and x_zx_
share the same convolutional weights
, despite different styles
.
However, for example, the first x_xy_
correspond to x_xy
, but those of x_yz_
correspond to x_yx
as transposed x_xy
despite pooling. Therefore, the first weights
apply to both original x_xy
and its transpose. The weights
are not necessarily symmetric, and the convolution itself is not tranpose-invariant.
Does this matter?
It does not matter as the weights
is modulated by different styles
.
I understand. Thanks for your patient explanation. :)