This is an implementation of Transformers and LSTM to solve the problem of quotation prediction. The trained model is able to guess in which positions a single or double quote should be put. LSTM model is trained from scratch and character-based. Addition to LSTM, BERT and T5 models are used. According to the experiments, T5 seems to outperform other models. Refer to the report for implementation details and results.
python data.py --min-len 50 --max-len 500 --silicone --wiki --output-path ./dataset/
optional arguments:
-h, --help show this help message and exit
--min-len MIN_LEN Discard sentences with length less than this value
--max-len MAX_LEN Discard sentences with length greater than this value
--silicone Include informal datasets
--wiki Include formal dataset
--output-path OUTPUT_PATH
Output path for gathered data
Please check 'config.py' for all configuration options. They should be self-explanatory. Also, there are example configurations for all three of models in configs/ folder.
For example, to train a BERT model:
python main.py --config configs/bert.yml