tudelft/event_flow

Question about the idea of average timestamp image

petercmh01 opened this issue · 8 comments

Dear authors, thank you for the great work. I am fairly new to optical flow research and I'm having some tough time understanding the average timestamp image loss in the paper. Can you further explain it? Thanks in advance.

To be specific, I can understand how events are warped to t_ref, optical flow and interpolations,

but I failed to understand the meaning of fw_iwe_pos_ts from here: fw_iwe_pos_ts = interpolate(fw_idx.long(), fw_weights * ts_list, self.res, polarity_mask=pol_mask[:, :, 0:1])

and I failed to get how the image create from fw_iwe_pos_ts /= fw_iwe_pos + 1e-9 can produce a loss to direct learning of optical flow estimation.

Hi Peter, thanks for the question!

The lines you mention implement equation 2 in the paper, with fw_iwe_pos_ts the numerator and fw_iwe_pos the denominator (for positive polarity). Dividing these gives the image of average timestamps for this polarity. You might have understood this already.

We learn optical flow by looking at the blur in a window of events (events generated by the same moving edge are spread out) and trying to remove this blur (move the events in space and time) as good as possible. If there's no remaining blur, it means we compensated correctly for the motion that generated the events, hence we estimated optical flow correctly.

Equation 2 contributes to this in the following way: if we want to minimize the loss (eqs 3 and 4), we should minimize eq 2. The drawing below illustrates how deblurring events into fewer pixels leads to a lower loss than leaving them spread out in the image as blur.

screenshot

I hope this explains things! If not, keep asking questions :)

Thank you very much for your help!! The explanation has been very clear. I was able to understand the idea of the loss so I will close this issue for now : )

This is purely my opinion and there is no clear answer, I believe. Just sharing my experience as someone who have been thinking of this for quite a while. I understand a confusion of @petercmh01 as I have similar one.

Let's call this loss as normalized_average_timestamps, which has the number of event counts in the denominator. The numerator part only, which I call average_timestamp here, was originally proposed by Zhu et al. CVPR 2019: https://openaccess.thecvf.com/content_CVPR_2019/html/Zhu_Unsupervised_Event-Based_Learning_of_Optical_Flow_Depth_and_Egomotion_CVPR_2019_paper.html

And in this CVPR2019 paper, the authors discuss the interpretation of average_timestamp in the last paragraph of Section 3.2.

We can see that the gradient of the average timestamp image, (dt/dx, dt/dy), corresponds to the inverse of the flow, if we assume that all events at each pixel have the same flow.

Actually by reading this I was quite confused, because if this interpretation is true, minimizing the loss means maximizing the flow. That is why, experimentally we observe undesired optima for the flow that pushes all events out of the image plane. I guess the authors of this repo (the NeurIPS paper) are also aware of it.

Then, now back to this paper's normalized_average_timestamp, I am not sure still what is the numerator of this loss function. Imagine that the timestamps of eveny event in @Huizerd's example are all same, 1. Still in the left case, we get loss = 3, and loss=1 in the right case. So this could work as the loss function without proper timestamp information - which makes me not sure how to interpret the numerator. Rather, having the denominator, the number of events, makes this loss function behave like a family of the contrast functions (in CMax), I guess.

You can also check our paper: https://arxiv.org/abs/2207.10022 (code) in the supplementary, Fig.10, we analyse both average_timestamp and normalized_average_timestamp.

@shiba24 Thank you so much for your guidance!! I really appreciate it and will definitely checkout and update : )

Hi @shiba24,

Thanks for sharing your insights on the intuition behind the loss function :)

I'll try following up later with a more detailed answer, but I just wanted to highlight a small mistake in the counterexample that you provided. If all the events share timestamp, let's say 1, in the left example from @Huizerd, the loss that we would get with the normalized_average_timestamp in that case would be equal to 1 and not to 3. The numerator would be sum(1 + 1 + 1) = 3 and the denominator (i.e., the number of pixels with at least one event) would also be 3. This means that the loss in both cases would be the same. If this was used in an optimization framework, the optimization will not be properly defined, meaning that we would get the same loss everywhere, and hence the optimization could pick any flow solution. Note that this could be the expected behavior since the input data does not contain any temporal information from which flow can be retrieved.

Other CMax losses not based on timestamp information (e.g., variance, gradient, etc.) would behave of course differently to this example. They would just converge to the solution that puts all the events together regardless of their timestamp as long as the warping dt = (tref - ti) gives them a bit of freedom to displace the events. However, the fact that there is a solution doesn't necessarily mean that it is a valid optical flow solution.

Thank you Fede!!
From my understanding of the image provided by @Huizerd, for timestamps of three events a, b, c (general case), the left one becomes loss = a/1 + b/1 + c/1 and it's 3 for both (a, b, c) = (0, 1, 2) and (a, b, c) = (1, 1, 1), no? Or the example image is incorrect?

What @Huizerd is showing in that image is an illustration of the original Zhu's loss function (i.e., without the normalization). So what you say in your last message is correct about the left example: the original loss would be 3 in the (0, 1, 2) and (1, 1, 1) case.

Then, now back to this paper's normalized_average_timestamp, I am not sure still what is the numerator of this loss function. Imagine that the timestamps of eveny event in @Huizerd's example are all same, 1. Still in the left case, we get loss = 3, and loss=1 in the right case. So this could work as the loss function without proper timestamp information - which makes me not sure how to interpret the numerator. Rather, having the denominator, the number of events, makes this loss function behave like a family of the contrast functions (in CMax), I guess.

I now realize that in this paragraph you were referring to the numerator of the normalize loss function. My apologies then, as you were right in your original message. The numerator of the normalized loss is identical to Zhu's loss.

I wasn't paying much attention to writing that part clearly 🙏.

Anyway, based on the image, the left becomes actually always larger than the right: a/1 + b/1 + c/1 = a + b + c > (a + b + c)/3, as long as a + b + c > 0, regardless of the actual values of (a,b,c). And the normalized version with the area of the event pixels, I guess both left/right become (a+b+c)/3 (again regardless of a,b,c)? Please correct me if it's wrong.
Well though, this is not reality and the real event data is not that simple.

Looking forward to your detailed reply -- as noted in the beginning of my first comment, I do not have a clear answer, nor any intention to trying to find a conclusion on this topic here.
Empirically both approach (the contrast functions and the timestamp-based ones) might work for flow, but the timestamp-based ones, both with/without normalization, are difficult for me to interpret why it converges to desired flow.