Evaluation script Bugs?

Question

Evaluation script Bugs?

voldemortX opened this issue 3 years ago · 5 comments

Hi guys, thanks for the amazing dataset!
However, me and my colleagues have encountered several issues with your evaluation script, which made us unable to get 100% accuracy when testing GT against GT:

You set distance to invisible point (annotated invisible for gt or out-of-range invisible for pred) as dist_th:

https://github.com/OpenPerceptionX/OpenLane/blob/f74ecca299e032e100c0ca200a3299c1745de084/eval/LANE_evaluation/lane3d/eval_3D_lane.py#L159
https://github.com/OpenPerceptionX/OpenLane/blob/f74ecca299e032e100c0ca200a3299c1745de084/eval/LANE_evaluation/lane3d/eval_3D_lane.py#L179
https://github.com/OpenPerceptionX/OpenLane/blob/f74ecca299e032e100c0ca200a3299c1745de084/eval/LANE_evaluation/lane3d/eval_3D_lane.py#L190

So the x & z error counting will be off, they will be at least dist_th = 1.5 for invisible points, I'm guessing these distances should be ignored here.

Because of 1, if a GT line is entirely invisible, any pred's distance to this GT line will be exactly dist_th = 1.5, then it won't pass the initial check here:
https://github.com/OpenPerceptionX/OpenLane/blob/f74ecca299e032e100c0ca200a3299c1745de084/eval/LANE_evaluation/lane3d/eval_3D_lane.py#L203
and will be accumulated as FP/FN error. Simply removing this could have other consequences like division by 0 later in:
https://github.com/OpenPerceptionX/OpenLane/blob/f74ecca299e032e100c0ca200a3299c1745de084/eval/LANE_evaluation/lane3d/eval_3D_lane.py#L208

Anyways, this problem should not show because the script filters lines to have at least 2 visible points. However, the x range filtering is inconsistent between:
https://github.com/OpenPerceptionX/OpenLane/blob/f74ecca299e032e100c0ca200a3299c1745de084/eval/LANE_evaluation/lane3d/eval_3D_lane.py#L104
and

https://github.com/OpenPerceptionX/OpenLane/blob/f74ecca299e032e100c0ca200a3299c1745de084/eval/LANE_evaluation/lane3d/eval_3D_lane.py#L121

Also, there is no filtering after interpolation, if a line has 2 visible points before interpolation but don't afterwards, it will also produce entirely invisible lines. For example, one line has y coordinates [23.5 23.8] and is valid, but y_samples are only integers, it won't be valid after (ex)interpolation.

Btw, by testing GT against GT, I can only get around 87% F1. I saved GT after the coordinate transform and filtering. If you could clarify the ignore mechanism, I can make a pull request to fix this for you. There are two popular ignore mechanisms in metrics, I think the first one sounds better and aligns more with your original metric (only suggestions here):

ignore and let the prediction predict anything (e.g., the 255 ignore index in segmentation datasets).
neither encourage nor discourage a pred, provided if it matches with an ignored GT (e.g., the MOTChallenge non-pedestrian classes are ignored if matched with IoU 0.5, otherwise count the pred as a FP).

I think these issues could have been inherited from the synthetic benchmark. And they could non-trivially influence your already evaluated results.

cc @ChonghaoSima @dyfcalid @hli2020

voldemortX commented 3 years ago

Great!

Answer 1 · 2022-06-15T04:40:11.000Z

Thank you for raising this issue. We'll check it and reply to you later

Answer 2 · 2022-06-15T06:41:49.000Z

For lanes that are entirely invisible, yes the problem exists here. We are trying to fix this by simply ignoring them in evaluation. But we insist those annotations are meaningful since invisible lanes are still part of the local map, and we will keep them in gt json.
For the inconsistent x range filtering, we're checking if that has an influence on the evaluation result. But yes it's inconsistent.
Filtering after interpolation. We're going to test it to see the difference in results.

Again, thank you for pointing out these issues. Previously we intend to keep consistent with Apollo evaluation code so the adaptation should be easy. We're fixing these bugs and could you make a pull request about them so we can do a double check?

Answer 3 · 2022-06-15T06:44:49.000Z

@ChonghaoSima I've only yet fixed the F score, not the x & z errors. I will make a WIP pull request for you to cross-check.

Answer 4 · 2022-07-03T12:06:44.000Z

We've fixed this issue and now GT against GT evaluation is perfectly matched, and we will update all related results in our paper, thank you for pointing this out.