Setup
-
Make sure you have Miniconda installed
- Conda is a package manager that sandboxes your project’s dependencies in a virtual environment
- Miniconda contains Conda and its dependencies with no extra packages by default (as opposed to Anaconda, which installs some extra packages)
-
cd into src, run
conda env create -f environment.yml
- This creates a Conda environment called
squad
- This creates a Conda environment called
-
Run
conda activate squad
- This activates the
squad
environment - Do this each time you want to write/test your code
- This activates the
-
Run
python setup.py
- This downloads SQuAD 2.0 training and dev sets, as well as the GloVe 300-dimensional word vectors (840B)
- This also pre-processes the dataset for efficient data loading
- For a MacBook Pro on the Stanford network,
setup.py
takes around 30 minutes total
-
Browse the code in
train.py
- The
train.py
script is the entry point for training a model. It reads command-line arguments, loads the SQuAD dataset, and trains a model. - You may find it helpful to browse the arguments provided by the starter code. Either look directly at the
parser.add_argument
lines in the source code, or runpython train.py -h
.
- The
Experiments
Id | Code | Description | Notes | load_path | train-NLL | train-F1 | train-EM | train-AvNA | dev-NLL | dev-F1 | dev-EM | dev-AvNA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
baseline | models.py | Baseline | No hyper-params tuning, hence dev-set used like test set. | save/train/baseline-05/best.pth.tar | 1.73 | 76.40 | 68.82 | 85.71 | 3.35 | 57.31 | 53.87 | 64.44 |
baseline_plus_1 | models.py | Baseline with larger modelsize and number of layers. --hidden_size 150 , --drop_prob 0.3 - |
The model appears to have a higher bias than baseline although the bigger capacity. This is due to the higher dropout probability. | save/train/baseline_plus_1-01/best.pth.tar | 02.27 | 68.44 | 60.75 | 80.29 | 03.22 | 56.62 | 53.25 | 63.75 |
baseline_char_embed | models.py | Baseline with character embedding. | The model comes with higher capacity and overfits after 3 epochs. It is necessary some regularization. | save/train/baseline_char_embed-04/best.pth.tar | 05.27 | 52.19 | 52.19 | 52.14 | ||||
baseline_char_embed | models.py | Baseline with character embedding and higher dropout (0.3 vs. 0.2). --drop_prob 0.3 |
save/train/baseline_char_embed-05/best.pth.tar | 01.69 | 76.95 | 69.09 | 86.17 | 03.37 | 58.02 | 54.34 | 65.33 | |
baseline_char_embed | models.py | Baseline with character embedding and higher dropout (0.4 vs. 0.2). --drop_prob 0.4 |
save/train/baseline_char_embed-06/best.pth.tar | 02.31 | 67.67 | 59.99 | 79.78 | 03.07 | 57.13 | 53.79 | 64.68 | |
baseline_char_embed_bert_enc | models.py | Baseline with character embedding and higher dropout (0.3). --drop_prob 0.3 and BERT-encoder --hidden_size 256 enstead of RNN-encoder |
save/train/baseline_char_embed_bert_enc-19/best.pth.tar | 01.59 | 78.12 | 70.96 | 86.72 | 03.36 | 58.89 | 55.82 | 65.60 | |
baseline_char_embed_bert_enc | models.py | Baseline with character embedding, dropout (0.2) leaving BERT encoder dropout to 0.1 --drop_prob 0.2 and BERT-encoder --hidden_size 256 instead of RNN-encoder |
89.95-60.30=~30 --> low model bias, high variance ... there is material to work on | save/train/baseline_char_embed_bert_enc-18/best.pth.tar | 00.84 | 89.95 | 83.69 | 94.87 | 03.65 | 60.30 | 57.02 | 67.72 |
baseline_char_embed_bert_enc | models.py | Baseline with character embedding, dropout (0.2) leaving BERT encoder dropout to 0.1 --drop_prob 0.2 and BERT-encoder --hidden_size 256 instead of RNN-encoder. BERT encoder has 3 layers (instead of 6) and d_ff, the depth of feed-forward layer, is ~depth of embedding (instead of 2048) |
73.98-60.41=~13 | save/train/baseline_char_embed_bert_enc-17/best.pth.tar | 01.92 | 73.98 | 66.93 | 83.74 | 02.89 | 60.41 | 57.39 | 66.80 |
baseline_char_embed_bert_enc | models.py | Baseline with character embedding, dropout (0.3) leaving BERT encoder dropout to 0.1 --drop_prob 0.3 and BERT-encoder --hidden_size 256 instead of RNN-encoder |
save/train/baseline_char_embed_bert_enc-20/best.pth.tar | 01.44 | 80.66 | 73.53 | 88.42 | 03.19 | 60.03 | 56.86 | 66.85 | |
baseline_char_embed_bert_enc | models.py | Baseline with character embedding, dropout (0.2) leaving BERT encoder dropout to 0.1 --drop_prob 0.2 and BERT-encoder --hidden_size 256 instead of RNN-encoder. L2 regularization --l2_wd 0.1 |
L2 regularization & dropout don't look to work well togther. | save/train/baseline_char_embed_bert_enc-22/best.pth.tar | 06.80 | 33.37 | 33.37 | 33.36 | 05.70 | 52.19 | 52.19 | 52.14 |
baseline_char_embed_bert_enc | models.py | Baseline with character embedding, dropout (0.2) leaving BERT encoder dropout to 0.1 --drop_prob 0.2 and BERT-encoder --hidden_size 256 instead of RNN-encoder. BERT encoder has 3 layers (instead of 6) and d_ff, the depth of feed-forward layer, is ~depth of embedding (instead of 2048). + twist embeddings |
save/train/baseline_char_embed_bert_enc-27/best.pth.tar | 01.63 | 78.05 | 70.89 | 86.54 | 03.12 | 59.88 | 56.80 | 66.64 | |
baseline_char_embed_bert_enc_bert_mod | models.py | Baseline with character embedding, dropout (0.2) leaving BERT encoder dropout to 0.1 --drop_prob 0.2 and BERT-encoder --hidden_size 256 instead of RNN-encoder with 4 layers. Model layer is a BERT encoder as well (6 layers). Necessary --batch_size 16 |
prehaps, is the capacity of the network too high? | save/train/baseline_char_embed_bert_enc_bert_mod-05/best.pth.tar | 06.59 | 33.37 | 33.37 | 33.36 | 05.42 | 52.19 | 52.19 | 52.14 |
baseline_char_embed_bert_enc_bert_mod | models.py | Baseline with character embedding, dropout (0.2) leaving BERT encoder dropout to 0.1 --drop_prob 0.2 and BERT-encoder --hidden_size 256 instead of RNN-encoder. BERT encoder has 3 layers (instead of 6) and d_ff, the depth of feed-forward layer, is ~depth of embedding (instead of 2048). Modeling layer comes with another BERT encoder (3 layer). --batch_size 32 |
save/train/baseline_char_embed_bert_enc_bert_mod-07/best.pth.tar | 02.88 | 67.81 | 59.19 | 81.80 | 04.45 | 53.49 | 49.22 | 61.92 |