by Vijay Viswanathan (based on the previous minbert-assignment)
In this assignment, you will implement some important components of the Llama2 model to better understanding its architecture.
You will then perform sentence classification on sst
dataset and cfimdb
dataset with this model.
The code to implement can be found in llama.py
, classifier.py
and optimizer.py
. You are reponsible for writing core components of Llama2 (one of the leading open source language models). In doing so, you will gain a strong understanding of neural language modeling. We will load pretrained weights for your language model from stories42M.pt
; an 8-layer, 42M parameter language model pretrained on the TinyStories dataset (a dataset of machine-generated children's stories). This model is small enough that it can be trained (slowly) without a GPU. You are encouraged to use Colab or a personal GPU machine (e.g. a Macbook) to be able to iterate more quickly.
Once you have implemented these components, you will test our your model in 3 settings:
- Generate a text completion (starting with the sentence
"I have wanted to see this thriller for a while, and it didn't disappoint. Keanu Reeves, playing the hero John Wick, is"
). You should see coherent, grammatical English being generated (though the content and topicality of the completion may be absurd, since this LM was pretrained exclusively on children's stories). - Perform zero-shot, prompt-based sentiment analysis on two datasets (SST-5 and CFIMDB). This will give bad results (roughly equal to choosing a random target class).
- Perform task-specific finetuning of your Llama2 model, after implementing a classification head in
classifier.py
. This will give much stronger classification results. - If you've done #1-3 well, you will get an A! However, since you've come this far, try implementing something new on top of your hand-written language modeling system! If your method provides strong empirical improvements or demonstrates exceptional creativity, you'll get an A+ on this assignment.
- Follow
setup.sh
to properly setup the environment and install dependencies. - There is a detailed description of the code structure in structure.md, including a description of which parts you will need to implement.
- You are only allowed to use libraries that are installed by
setup.sh
, no other external libraries are allowed (e.g.,transformers
). - We will run your code with commands below (under "Reference outputs/accuracies"), so make sure that whatever your best results are reproducible using these commands.
- Do not change any of the existing command options (including defaults) or add any new required parameters
Text Continuation (python run_llama.py --option generate
)
You should see continuations of the sentence I have wanted to see this thriller for a while, and it didn't disappoint. Keanu Reeves, playing the hero John Wick, is...
. We will generate two continuations - one with temperature 0.0 (which should have a reasonably coherent, if unusual, completion) and one with temperature 1.0 (which is likely to be logically inconsistent and may contain some coherence or grammar errors).
Zero Shot Prompting Zero-Shot Prompting for SST:
python run_llama.py --option prompt --batch_size 10 --train data/sst-train.txt --dev data/sst-dev.txt --test data/sst-test.txt --label-names data/sst-label-mapping.json --dev_out sst-dev-prompting-output.txt --test_out sst-test-prompting-output.txt [--use_gpu]
Prompting for SST: Dev Accuracy: 0.213 (0.000) Test Accuracy: 0.224 (0.000)
Zero-Shot Prompting for CFIMDB:
python run_llama.py --option prompt --batch_size 10 --train data/cfimdb-train.txt --dev data/cfimdb-dev.txt --test data/cfimdb-test.txt --label-names data/cfimdb-label-mapping.json --dev_out cfimdb-dev-prompting-output.txt --test_out cfimdb-test-prompting-output.txt [--use_gpu]
Prompting for CFIMDB: Dev Accuracy: 0.498 (0.000) Test Accuracy: -
Classification Finetuning
python run_llama.py --option finetune --epochs 5 --lr 2e-5 --batch_size 80 --train data/sst-train.txt --dev data/sst-dev.txt --test data/sst-test.txt --label-names data/sst-label-mapping.json --dev_out sst-dev-finetuning-output.txt --test_out sst-test-finetuning-output.txt [--use_gpu]
Finetuning for SST: Dev Accuracy: 0.414 (0.014) Test Accuracy: 0.418 (0.017)
python run_llama.py --option finetune --epochs 5 --lr 2e-5 --batch_size 10 --train data/cfimdb-train.txt --dev data/cfimdb-dev.txt --test data/cfimdb-test.txt --label-names data/cfimdb-label-mapping.json --dev_out cfimdb-dev-finetuning-output.txt --test_out cfimdb-test-finetuning-output.txt [--use_gpu]
Finetuning for CFIMDB: Dev Accuracy: 0.800 (0.115) Test Accuracy: -
Mean reference accuracies over 10 random seeds with their standard deviation shown in brackets.
Code: You will submit a full code package, with output files, on Cogniterra.
Report (optional): Your zip file can include a pdf file, named ANDREWID-report.pdf, if (1) you've implemented something else on top of the requirements and further improved accuracy for possible extra points (see "Grading" below), and/or (2) if your best results are with some hyperparameters other than the default, and you want to specify how we should run your code. If you're doing (1), we expect your report should be 1-2 pages, but no more than 3 pages. If you're doing (2), the report can be very brief.
For submission via Cogniterra,
the submission file should be a zip file with the following structure (assuming the
lowercase Andrew ID is ANDREWID
):
ANDREWID/
├── run_llama.py
├── base_llama.py
├── llama.py
├── rope.py
├── classifier.py
├── config.py
├── optimizer.py
├── sanity_check.py
├── tokenizer.py
├── utils.py
├── README.md
├── structure.md
├── sanity_check.data
├── generated-sentence-temp-0.txt
├── generated-sentence-temp-1.txt
├── sst-dev-prompting-output.txt
├── sst-test-prompting-output.txt
├── sst-dev-finetuning-output.txt
├── sst-test-finetuning-output.txt
├── cfimdb-dev-prompting-output.txt
├── cfimdb-test-prompting-output.txt
├── cfimdb-dev-finetuning-output.txt
├── cfimdb-test-finetuning-output.txt
└── setup.sh
prepare_submit.py
can help to create(1) or check(2) the to-be-submitted zip file. It
will throw assertion errors if the format is not expected, and submissions that fail
this check will be graded down.
Usage:
- To create and check a zip file with your outputs, run
python3 prepare_submit.py path/to/your/output/dir ANDREWID
- To check your zip file, run
python3 prepare_submit.py path/to/your/submit/zip/file.zip ANDREWID
Please double check this before you submit to Canvas; most recently we had about 10/100 students lose a 1/3 letter grade because of an improper submission format.
- A+: You additionally implement something else on top of the requirements for A, and achieve significant accuracy improvements or demonstrate exceptional creativity. This improvement can be in either the zero-shot setting (no task-specific finetuning required) or in the funetuning setting (improving over our current finetuning implementation). Please write down the things you implemented and experiments you performed in the report. You are also welcome to provide additional materials such as commands to run your code in a script and training logs.
- perform continued pre-training using the language modeling objective to do domain adaptation
- enable zero-shot prompting using a more principled inference algorithm than our current implementation. For example, we did not include an attention mask despite left-padding all inputs (to enable batch prediction); this could be improved.
- perform prompt-based finetuning
- add regularization to our finetuning process
- try parameter-efficient finetuning (see Section 2.2 here for an overview)
- try alternative fine-tuning algorithms e.g. SMART or WiSE-FT
- add other model components on top of the model
- A: You implement all the missing pieces and the original
classifier.py
with--option prompt
and--option finetune
code such that coherent text (i.e. mostly grammatically well-formed) can be generated and the model achieves comparable accuracy (within 0.05 accuracy for SST or 0.15 accuracy for CFIMDB) to our reference implementation. - A-: You implement all the missing pieces and the original
classifier.py
with--option prompt
and--option finetune
code but coherent text is not generated (i.e. generated text is not well-formed English) or accuracy is not comparable to the reference (accuracy is more than 0.05 accuracy or 0.15 accuracy from our reference scores, for for SST and CFIMDB, respectively). - B+: All missing pieces are implemented and pass tests in
sanity_check.py
(llama implementation) andoptimizer_test.py
(optimizer implementation) - B or below: Some parts of the missing pieces are not implemented.
If your results can be confirmed through the submitted files, but there are problems with your code submitted through Cogniterra, such as not being properly formatted, not executing in the appropriate amount of time, etc., you will be graded down 1/3 grade (e.g. A+ -> A or A- -> B+).
All assignments must be done individually and we will be running plagiarism detection on your code. If we confirm that any code was plagiarized from that of other students in the class, you will be subject to strict measure according to CMUs academic integrity policy. That being said, you are free to use publicly available resources (e.g. papers or open-source code), but you must provide proper attribution.
This is an exercise forked and slightly changed from an exercise from Carnegie Mellon University's CS11-711 Advanced NLP course.
This code is based on llama2.c by Andrej Karpathy. Parts of the code are also from the transformers
library (Apache License 2.0).