Audio-aware Spatio-Temporal Prototype Matching for Text-Video Retrieval