Several issues regarding the BLIP3 code

Question

Several issues regarding the BLIP3 code

suikei-wang opened this issue 4 months ago · 6 comments

suikei-wang commented 4 months ago

Hi there,

Thanks for the great work!

I found there are several issues when I tried to set up BLIP3 locally.

pip install -e . turns out FileNotFoundError: [Errno 2] No such file or directory: 'requirements-eval.txt' which is defined in the setup.py.
There is no inference.ipynb provided but mentioned in the readme for example inference code.
I set it up through hugging face repo, and tested on example asking it return short answer. However it generates a lot meaningless content: ==> prediciton: The 8-directional moving head is a black, cylindrical component located at the top of this electric shaver. It features multiple rotating blades for effective hair cutting and styling in various directions: upwards (1), downward(2) , left sideways to right sidesidesidewaysidewarsidetowards backsidedown towards fronttowsdownupleftrightbackforwardfacing forwardfront facing backwardsand upshiftingedshifttoslowlymoving slowlymovedirectionally directionallongitudinallongitudeverticalverticallyvaryinglyvariablyvariablevaried variableyieldedly yieldingsystematicallysystematic systematicalystematicsymmetrysymmetric symmetryysimilarsimilaritysimilartransformationstransformationaltransformationstranformedtranforms transformative transformationistheoreticetheoretictheoriescientificsciencescience scientificsscientiossignificationsignified signifiesignedigitaldigits digitized digitalizationdigitised digitisationdigressionregressionregulation regulated regulatoryregular regularitiesrigorous rigourous righteouslyjusticejusjustioustruthful truthfullyhonest honest honestyintegrated I just asked it to be succinct in 1 sentence and it turns out to be like this. I am not sure if they are paddings?

Thanks again :)

Answer 1 · 2024-08-21T01:16:20.000Z

Hi @suikei-wang, thank you for trying out our code and the feedback.

We have updated setup.py by removing requirements-eval.txt. requirements-eval.txt is specified in the original open-flamingo evaluation code, but not used in our SFT evaluation.
inference.ipynb Thank you for pointing this out. The file was not pushed to the repo because it unexpectedly matched a pattern in .gitignore. We have now uploaded the script.
Which inference code were you using for the Huggingface model? We have provided this script in our model hub. I haven't encountered the issue when running the said script in my local environment. Could you check your transformers version? (The one I'm using is '4.41.2', I suppose any version higher than this would also work.)

Thanks again for your feedback!

Answer 2 · 2024-08-21T04:43:05.000Z

Thanks for the quick response!
So I am using the xgen-mm-phi3-mini-base-r-v1.5 and the exact demo script mentioned in that HF notebook. I just put them in .py file to run. I am using the latest version of transformers as I just setup the env today. So it should be 4.44.1.

I tried different prompts to a shaver image. I first asked it to generate a description, and it works well. When I tried to force it for short output, it becomes unstable. Sometimes it still outputs a long sentence, and sometimes it turns out to pad these random words.

I am running it on NVIDIA V100 32GB with Ubuntu 24.04. Not sure if it matters.

Answer 3 · 2024-08-21T05:55:37.000Z

Hi @suikei-wang Thanks for your interest in our model! Could you explain how you force it for short output? The base model doesn't know to follow instructions, and the way we train the model doesn't encourage short output. You may need to feed in 1 or 2 in-context examples for the model to follow the instruction. See the figure below or check our model card here I am using a template <image>\n\nAnswer: XXX\n<image>\n\nAnswer: YYY\n<image>\n\nAnswer: ZZZ Note a prefix such as Answer: or Output: may help stabilize output format.

Answer 4 · 2024-08-22T22:30:42.000Z

@zzxslp Thanks for the reply! I use the following prompt to force it:
Please provide a description of the 8-directional moving head in this image. Please be short, in 1 sentence. Do not include any other feature description. It turns out the output above.

I also tried the few-shot one with 3 examples. The output is short but does not make sense... For example I ask for description of A, it turns out the description of B.

But anyway thanks for the work!

Answer 5 · 2024-08-23T23:10:38.000Z

Hi @suikei-wang, The base model doesn't know to follow instructions. Could you try our instruct model instead when feeding instructions?

As for few-shot learning, it is quite sensitive to prompts, would you mind sharing the exact few-shot template you are using? (also would be great to attach the images so I can try reproduce on my side)

Answer 6 · 2024-11-23T07:25:17.000Z

Hi @zzxslp , sorry for my very late reply as it involves a recent submission.
So basically I realized that this is a common issue in LLMs/MLLMs, due to their hullucinations. For example, when I input an image, and ask it for description of a part (e.g. can you segment the 8-directional moving head of the shaver in this image?), most MLLMs give me description of the entire shaver. Also for some system prompts as mentioned in this issue, they cannot follow it correctly (generate a short response / reply yes or no only, etc... ). Only a few commercial models can solve this well (Pixtral 12B, Qwen 2-VL, Claude 3.5, GPT 4o). Anyway I am happy to discuss more privately! Let me know if you are open for this :)