multi_implicit_cot: A Python repository from EnronMusk

Research credit is not my own and belongs here

Model demo notebook is stored with instructions and visuals: here

Simply run the notebook to explore results. The notebook does request user input for a custom prediction, but otherwise its fully automated.

Data Format

The format of training and test datasets follow this format:

[input 1a] $ [input 1b]||[CoT 1a] $ [CoT 1b] #### [output 1a] $ [output 1b]
[input 2a] $ [input 2b]||[CoT 2a] $ [CoT 2b] #### [output 2a] $ [output 2b]
[input 3a] $ [input 3b]||[CoT 3a] $ [CoT 3b] #### [output 3a] $ [output 3b]

Example entry:

1 7 * 1 3 $ 6 2 * 6 3||1 7 0 + 0 3 1 2 $ 6 5 1 + 0 8 7 0 #### 1 0 2 2 $ 6 3 9 0

Each multiplication is delimited by $. The 1 7 * 1 3 corresponds to 31 * 71 and 1 7 0 + 0 3 1 2 corresponds to 2130 + 71 and 1 0 2 2 corresponds to 2201

Dataset is dynamically generated and saved: here

Referenced paper: here

Results with shared states

Used gpt-2 small (12 layers) and 777k training dataset and 77k test dataset:

Model	Loss	Test	Train
Teacher	Perplexitity: 1.000465	Test Accuracy: 0.997169	Training Accuracy: 0.999882
ThoughtEmulator	Loss: 4.369609	Quasi Test Accuracy 0.977900	Quasi Training Accuracy: 0.977773
MindReadingEmulator	Perplexitity: 1.000601	Test Accuracy: 0.996688	Training Accuracy: 0.999745
ImplicitStudent	Perplexitity: 1.000000	Test Accuracy: 1.000000	Training Accuracy: 1.000000

Results with simultaneous implicit inference

Used gpt-2 small (12 layers) and 777k training dataset and 77k test dataset:

Model	Loss	Test	Train
Teacher	Perplexitity: 1.000000	Test Accuracy: 0.998857	Training Accuracy: 0.999954
ThoughtEmulator	Loss: 613.826971	Quasi Test Accuracy 0.759698	Quasi Training Accuracy: 0.653752
MindReadingEmulator	Perplexitity: 1.012221	Test Accuracy: 0.931091	Training Accuracy: 0.994157
ImplicitStudent	Perplexitity: 1.003074	Test Accuracy: 0.988571	Training Accuracy: 0.999043

Notes

The implicit student model performed exceptionally well after being retrained on the train data and its accuracy statistics are accurate to 6 decimal places.

The teacher model can have much better performance with higher eta (learning rate). I multipled eta by 8/5 and saw 0.999831 test accuracy and 0.999995 training, which are both significantly higher. This makes sense because our implicit student performed better than the teacher, when it was trained on the exact same data.

EnronMusk/multi_implicit_cot

Research credit is not my own and belongs here

Data Format

Results with shared states

Results with simultaneous implicit inference

Notes