/multi_implicit_cot

Answer to solve this challenge https://yuntiandeng.com/cv/challenge.png

Primary LanguagePython

Research credit is not my own and belongs here

Model demo notebook is stored with instructions and visuals: here

Simply run the notebook to explore results. The notebook does request user input for a custom prediction, but otherwise its fully automated.

Data Format

The format of training and test datasets follow this format:

[input 1a] $ [input 1b]||[CoT 1a] $ [CoT 1b] #### [output 1a] $ [output 1b]
[input 2a] $ [input 2b]||[CoT 2a] $ [CoT 2b] #### [output 2a] $ [output 2b]
[input 3a] $ [input 3b]||[CoT 3a] $ [CoT 3b] #### [output 3a] $ [output 3b]

Example entry:

1 7 * 1 3 $ 6 2 * 6 3||1 7 0 + 0 3 1 2 $ 6 5 1 + 0 8 7 0 #### 1 0 2 2 $ 6 3 9 0

Each multiplication is delimited by $. The 1 7 * 1 3 corresponds to 31 * 71 and 1 7 0 + 0 3 1 2 corresponds to 2130 + 71 and 1 0 2 2 corresponds to 2201

Dataset is dynamically generated and saved: here

Referenced paper: here

Results with shared states

Used gpt-2 small (12 layers) and 777k training dataset and 77k test dataset:

Model Loss Test Train
Teacher Perplexitity: 1.000465 Test Accuracy: 0.997169 Training Accuracy: 0.999882
ThoughtEmulator Loss: 4.369609 Quasi Test Accuracy 0.977900 Quasi Training Accuracy: 0.977773
MindReadingEmulator Perplexitity: 1.000601 Test Accuracy: 0.996688 Training Accuracy: 0.999745
ImplicitStudent Perplexitity: 1.000000 Test Accuracy: 1.000000 Training Accuracy: 1.000000

Results with simultaneous implicit inference

Used gpt-2 small (12 layers) and 777k training dataset and 77k test dataset:

Model Loss Test Train
Teacher Perplexitity: 1.000000 Test Accuracy: 0.998857 Training Accuracy: 0.999954
ThoughtEmulator Loss: 613.826971 Quasi Test Accuracy 0.759698 Quasi Training Accuracy: 0.653752
MindReadingEmulator Perplexitity: 1.012221 Test Accuracy: 0.931091 Training Accuracy: 0.994157
ImplicitStudent Perplexitity: 1.003074 Test Accuracy: 0.988571 Training Accuracy: 0.999043

Notes

The implicit student model performed exceptionally well after being retrained on the train data and its accuracy statistics are accurate to 6 decimal places.

The teacher model can have much better performance with higher eta (learning rate). I multipled eta by 8/5 and saw 0.999831 test accuracy and 0.999995 training, which are both significantly higher. This makes sense because our implicit student performed better than the teacher, when it was trained on the exact same data.