Some questions about the open-loop evaluation framework.
flclain opened this issue ยท 4 comments
Congratulation for your great job and thanks for sharing the code.
In the paper "Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes", they designed an MLP-based method that takes raw sensor data (e.g., past trajectory, velocity, etc.) as input and directly outputs the future trajectory of the ego vehicle, without using any perception or prediction information such as camera images or LiDAR.
Surprisingly, such a simple method achieves state-of-the-art end-to-end planning performance on the nuScenes dataset, reducing the average L2 error by about 30%.
They concluded that maybe we need to rethink the current open-loop evaluation scheme of end-to-end autonomous driving in nuScenes.
What do you think of this experiment? https://github.com/E2E-AD/AD-MLP
Is there a problem with their experimental results, or we do need a new open-loop/close-loop evaluation framework?
Thank you for your interest, and you've posed an excellent question that has indeed been widely debated. I will try to address your question from four perspectives:
Firstly, open-loop evaluation is a frequently employed measure in machine learning-based planning and has a strong positive correlation with closed-loop metrics. For example, in the NuPlan competition, the open loop is used as one of the reference indicators. Most algorithms can demonstrate good consistency.
Secondly, the introduction of the vehicle's historical state information in the process of imitation learning can cause a phenomenon known as 'causal confusion'. While incorporating history does indeed enhance open-loop indicators, it significantly reduces closed-loop performance, thereby undermining its reference value. Similarly, in the experiment you mentioned, the introduction of historical information has caused the open-loop indicators to lose their reference value.
Thirdly, our choice of the Nuscenes dataset was driven by it being the largest and most comprehensive dataset currently available for end-to-end scenarios. However, this dataset is still too small for planning purposes. We hope to see the emergence of larger-scale end-to-end autonomous driving datasets in the future (such as NuPlan) to facilitate further end-to-end related research.
Lastly, there is no well-established evaluation method for end-to-end algorithms for autonomous driving scenarios based on real data. Most algorithms are only tested on open-source simulators such as CARLA. However, the simulation effects of CARLA and real data have a considerable 'domain gap', making the evaluation metrics somewhat limited in their implications. Additionally, there is a lack of expert data for training in the CARLA simulator. Current research involves manually creating a very naive rule-based expert system to generate Ground Truth (GT), but this expert system is far removed from human driver behavior. This leads to significant limitations of this benchmark and it does not effectively validate imitation learning algorithms. The industry has introduced many excellent evaluation systems based on real data, such as Nvidia's Drive Sim and Wabbi's UniSim. These are commendable initiatives, but sadly, they are not open-source, so we cannot test or experiment on these simulators.
Personally, I believe that closed-loop evaluations based on real data will be crucial in addressing autonomous driving issues in the future. We also have plans in this area for the future, and we hope to see more open-source work in this field to aid academic research.
Thank you very much for your detailed reply !
I'm closing this issue as it's settled , feel free to reopen it if needed @flclain.
A few points should be stated clearly:
UniAD reports the planning results for each timestamp, ie, L2_2s is the L2 error at timestamp 2s.
However, the planning metrics in VAD and AD-MLP adopts the code from ST-P3. For the reported L2 and Col at each time stamp, they average the results from time 0 to the concerned timestamp, ie, L2_2s = avg(L2_0.5s, L2_1s, L2_1.5s, L2_2s), and the final Avg. L2 value is the average of the already-averaged 3 values.
This will not change the finding that AD-MLP still gets very good results over carefully designed models (its Avg. L2 under the UniAD setting is the L2_3s in the AD-MLP paper). This just means directly taking numbers from the papers will lead to some inconsistency.