ScriptWriter
Neural network that fills in blank spaces in movie scripts
Data:
Pre-processing:
- txt format -> clean txt format (removing useless tabs, repeated \n etc. as little format noise as possible) !!!
- Dividing scripts into tokenized torch arrays randomly (needs to be more way efficient) !!!
Models:
-
Contrastive input: context, fragment output: similarity score
-
Generative input: masked context (mask where fragment should start) output: context + fragment or context + new word + shifted mask
Notes:
- Can't really use fully pretrained transformers since diff vocab because of formatting
- Use some pretrained layers from good transformers
Idea:
Contrastive model = value net
Generative model = policy net
Combine them with MCTS and generate many path's that the fragment could take
TODO:
- DATA TASKS:
- Data extraction pipeline improve efficiency !
- Data cleaning !!
- TOKENIZATION !!!
- How to tokenize data
- How to include whitespace
- In general figure out tokenization step
- Do we need new vocab or can we just use one
Useful papers for this project:
-
Transformers: https://arxiv.org/pdf/1706.03762.pdf
-
LayerNorm: https://arxiv.org/pdf/1607.06450.pdf