Cross-modality matching for VCR

Without fine-tuning

With fine-tuning