Can't reproduce that Page 6, Table 5, Evaluation on Point Cloud-Text Tasks' Bleu, METEOR and ROUGE_L numbers
zhurob opened this issue · 2 comments
I have used https://github.com/csuhan/OneLLM/blob/main/docs/Evaluation.md:
Point-Text Evaluation
PointLLM Caption
Download PointLLM data from this link
Fill pretrained_path in eval/point_cap_pointllm.py and run: python eval/point_cap_pointllm.py.
Evaluate with eval/caption_eval.py. The annotation file is at datasets/Eval/point/pointllm_test_cococap.json
I and several of my team members, all got similar Bleu, METEOR and ROUGE_L to reproduce your Table 5 on OneLLM, we all got very low numbers like below, also, CIDEr is zero. Can you please double check that? We believe that we are using same point cloud files and scripts and model. Thank you. Rob
SPICE: 0.094
Bleu_1: 0.104
Bleu_2: 0.065
Bleu_3: 0.045
Bleu_4: 0.034
METEOR: 0.131
ROUGE_L: 0.175
CIDEr: 0.000
SPICE: 0.094
From https://arxiv.org/pdf/2312.03700, Page 6, Table 5, Evaluation on Point Cloud-Text Tasks. The evalua�tion dataset is from Objaverse [16], following the data split in
PointLLM [92]. InstructBLIP takes single-view image as input,
while PointLLM and OneLLM take point cloud as input. GPT4-
Acc.: GPT4 as the accuracy evaluator [92].
Model Captioning Classification
BLEU-1 ROUGE-L METEOR GPT4-Acc.
InstructBLIP-7B [15] 11.2 13.9 14.9 38.5
InstructBLIP-13B [15] 12.6 15.0 16.0 35.5
PointLLM-7B [92] 8.0 11.1 15.2 47.5
PointLLM-13B [92] 9.7 12.8 15.3 45.0
One-LLM-7B (Ours) 42.2 45.3 20.3 44.5
Our point cloud caption results are evaluated with Phase II model: Multimodal Alignment. The final model after instruction tuning tends to output long and detailed response, while the caption benchmark requires a short sentence, making it perform bad on the benchmark.
A simple way to improve it is change the task prompt from: "What is this?" to "Provide a one-sentence caption".
https://github.com/csuhan/OneLLM/blob/73393b17a14fa58a179b450a2fe2d2d640dd61fc/eval/point_cap_pointllm.py#L38C21-L38C34
Good fix. Thank you very much, verified, works.