AV2 End-to-End Forecasting Challenge Submission Format

Question

AV2 End-to-End Forecasting Challenge Submission Format

yihongXU opened this issue 7 months ago · 1 comments

Dear organizer,

I would like to ask about the submission format concerning about AV2 2024 Unified Detection, Tracking, and Forecasting Challenge (in particular End-to-End Forecasting sub Challenge):

Which frames to predict for the testset submission? Is it that I need to predict at every frame (at 10Hz like the lidar?) the future 3s?
Do the timestamp_ns come from the lidar file name in each log?
It is clear that we are asked to predict future 3s but at what frequency? Is it 2Hz? (although the lidar sensor is collected at 10Hz)

To be clear about the questions, let's take the following example:
From the lidar file names, I collect the following timestamps, it's obvious that it is collected at 10Hz:

['315969904359876000.feather',
'315969904460072000.feather',
'315969904559605000.feather',
'315969904659802000.feather',
'315969904759998000.feather',
'315969904859531000.feather',
'315969904959727000.feather',
'315969905059929000.feather',
'315969905160125000.feather',
'315969905259658000.feather',
'315969905359854000.feather',
'315969905460051000.feather',
]

Should I produce forecasting results for every timestamp (so every 1/10 seconds)? Like:

{
<log_id>: {
315969904359876000: [
{
"prediction_m": k modes of future 3s at 2Hz
"detection_score": <detection_score>,
"instance_id": <instance_id>
"current_translation_m": <current_translation_m>,
"label": ,
"name": ,
"size": ,
}, ...
],
315969904460072000: [
{
"prediction_m": k modes of future 3s at 2Hz
"score":
"detection_score": <detection_score>,
"instance_id": <instance_id>
"current_translation_m": <current_translation_m>,
"label": ,
"name": ,
"size": ,
}, ...
],
}, ...
}

from https://argoverse.github.io/user-guide/tasks/e2e_forecasting.html, I see that there's 'timestep_ns' within each future prediction, what is it? Or it's not used? Like below:

example_forecasts = {
'02678d04-cc9f-3148-9f95-1ba66347dff9': {
315969904359876000: [
{'timestep_ns': 315969905359854000,
'current_translation_m': array([6759.4230302 , 1596.38016309]),
'detection_score': 0.54183,
'size': array([4.4779487, 1.7388916, 1.6963532], dtype=float32),
'label': 0,
'name': 'REGULAR_VEHICLE',
'prediction_m': ...
'score': [0.54183, 0.54183, 0.54183, 0.54183, 0.54183],
'instance_id': 0},
...
]
...
}
}

The above example give a different value for 'timestep_ns' and <timestamp_ns> while it's the same in the baseline example: https://github.com/neeharperi/LT3D/blob/e33189aa09f282cf555b64dba1654eb0c14464a6/forecasting/linear_forecaster.py#L34

In which coordinate system the future predictions should be produced? ego coordinate or (city) global coordinate?

Thanks in advance for your clarification.

Answer 1 · 2024-04-06T03:15:40.000Z

Hi @yihongXU,
Thanks for your interest in our challenge!

No, you do not need to predict future trajectories for every frame. We only evaluate a subset of frames. Concretely, given a list of frames F0, F1, F2, F3, F4, F5, F6, F7, F8, F9, F10, F11, F12, F13, F14, F15,...F20…F25...F30, all bolded frames are expected to have a future prediction for 3 seconds (at 2Hz). Therefore, at F0, we expect future predictions for F5, F10, F15, F20, F25, F30 and at F10 we expect future predictions for F15, F20, F25, F30, F35, F40. In summary, we are asked to predict future predictions every 10 frames (starting frame, 1Hz), and the future prediction is recorded every 5 frames (2Hz) for 6 times.
Yes, the timestamp_ns attributes come from the LiDAR file name in each log.
Yes, you should report future predictions at 2 Hz.
We use the timestamp_ns attribute for evaluation.
We evaluate all future predictions in the global (city) coordinate system.

Here is a link to a sample submission for the end-to-end forecasting val set to help you get started.