ViLLA: Fine-grained vision-language representation learning from real-world data
Primary LanguagePythonMIT LicenseMIT