/Multimodal-Graph-Script-Learning

Non-Sequential Graph Script Induction via Multimedia Grounding (ACL 2023)

Primary LanguagePythonMIT LicenseMIT

Non-Sequential Graph Script Induction via Multimedia Grounding

LICENSE Python PyTorch

Non-Sequential Graph Script Induction via Multimedia Grounding
Yu Zhou, Sha Li, Manling Li, Xudong Lin, Shih-Fu Chang, Mohit Bansal and Heng Ji
ACL 2023

Abstract

Online resources such as WikiHow compile a wide range of scripts for performing everyday tasks, which can assist models in learning to reason about procedures. However, the scripts are always presented in a linear manner, which does not reflect the flexibility displayed by people executing tasks in real life. For example, in the CrossTask Dataset, 64.5% of consecutive step pairs are also observed in the reverse order, suggesting their ordering is not fixed. In addition, each step has an average of 2.56 frequent next steps, demonstrating "branching". In this paper, we propose the new challenging task of non-sequential graph script induction, aiming to capture optional and interchangeable steps in procedural planning. To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks. In particular, we design a multimodal framework to ground procedural videos to WikiHow textual steps and thus transform each video into an observed step path on the latent ground truth graph script. This key transformation enables us to train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence. Our best model outperforms the strongest pure text/vision baselines by 17.52% absolute gains on F1@3 for next step prediction and 13.8% absolute gains on Acc@1 for partial sequence completion. Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.

Reproduce

Requirements

The models in this paper are runnable on a single Nvidia V-100 GPU and CUDA Version: 12.0.

Please see environment.yml for specific package requirements.


Training

For pre-training our models on HowTo100M, please refer to pretrain.py.

For finetuning models on CrossTask, please refer to finetune.py.

The corresponding data can be downloaded from their respective webpages: HowTo100M and CrossTask.


Evaluation

To generate probablistic schema graphs as shown in outputs, please refer to graph.py.

For evaluating trained models on Next Step Prediction and Partial Sequence Completion, please refer to next_step_prediction.py and partial_sequence_completion.py.

BibTeX

If you find the code in this repo useful, please consider citing our paper:

@inproceedings{zhou2023non,
  title={Non-Sequential Graph Script Induction via Multimedia Grounding},
  author={Zhou, Yu and Li, Sha and Manling, Li and Xudong, Lin and Chang, Shih-Fu and Bansal, Mohit and Ji, Heng},
  booktitle={Proc. the 61th Annual Meeting of the Association for Computational Linguistics (ACL2023)},
  year={2023}
}