Moondream for Object Localization

Question

Moondream for Object Localization

E5GEN2 opened this issue 7 months ago · 4 comments

I am wondering if Moondream can be used for grounding tasks such Object Localization? Something similar to what cogagent does with GUI but I would like to train on my custom dataset. If I fine-tune moondream on my custom dataset of images - bounding boxes + text is there a chance it would work?

Answer 1 · 2024-05-29T17:39:53.000Z

Yes - the current version of moondream can detect one object per image. If you query with Bounding box: {object} it will return an array of 4 floating point numbers that indicate the relative (x1, y1) and (x2, y2) positions for the top-left and bottom-right corners. The next release will add support for multiple objects and may also change the output format. I'll post an update here when it's out.

Answer 2 · 2024-05-29T18:10:26.000Z

Yes - the current version of moondream can detect one object per image. If you query with Bounding box: {object} it will return an array of 4 floating point numbers that indicate the relative (x1, y1) and (x2, y2) positions for the top-left and bottom-right corners. The next release will add support for multiple objects and may also change the output format. I'll post an update here when it's out.

what if i have a dataset of images + actions i.e.
{"x1": 420, "x2": 378, "y1": 1042, "y2": 245, "action": "swipe", "duration": 200}

would it be able to predict such actions if i train it on my dataset?

Is it possible to predict a next action for a sequence of images + actions? If not, what if I create a collage image of previous images + actions. Would it be able to learn such a task?

Answer 3 · 2024-06-04T07:45:57.000Z

While making dataset for fine tuning what is the format in which we have to give the co ordinates, and are you using another regression loss or is it completely the text decoder model giving the co ordinates as string.

Answer 4 · 2024-12-05T03:05:59.000Z

While making dataset for fine tuning what is the format in which we have to give the co ordinates, and are you using another regression loss or is it completely the text decoder model giving the co ordinates as string.

I have the same question.