How do you handle multi-round conversations in train and inference stage ?
laserwave opened this issue · 1 comments
Nice work. And I have a question regarding how you handle the multi-round conversations in both training and inference stage. Do you have to extract the feature of image once again? As the following question may require ability of different experts.
And the circumstances become complicated when it comes to multi-image comprehension as different images may need different experts. For example,
According to the text in image1 <image> and the region [0.3, 0.2, 0.5, 0.4] of image2 <image>, what can we infer.
Training:
Given an image
Inference:
In the multi-round conversation scenario, the used expert feature can be cached. If the assigned vision expert is activated before, we just input the cached feature and the current instruction to the MoV-Adapter. If a vision expert is never activated but is assigned for current instruction, we need to extract its features during this forward pass.