runnanchen/CLIP2Scene

Question about Semantic-guided Spatial-temporal Consistency Regularization

fang196 opened this issue · 0 comments

Thanks for the great work!
I have three questions about Semantic-guided Spatial-temporal Consistency Regularization.

  1. What is the reason for dividing the complete stitched point cloud into regular grids rather than using short-term temporality directly?
  2. What does the symbol * represent in Equation 3? Does it indicate a cross product operation?
  3. It is stated that the image is matched to the first frame of the point cloud $P_1$ using pixel-point correspondences ${\hat{x}i^1, \hat{p}i^1}{i=1}^{\hat{M}}$. This implies that for values of $k$ ranging from 1 to $K$, we have $t{\hat{i}}^k = t_{\hat{i}}^1$ and $\hat{x}{\hat{i}}^k = \hat{x}{\hat{i}}^1$. However, in Equation 4, the text embeddings are denoted as $t_{\hat{i}}^1$, while the image embeddings are denoted as $\hat{x}_{\hat{i}}^{\hat{k}}$. Why is this the case?