Official Implementation for Answering Diverse Questions via Text Attached with Key Audio-Visual Clues