flan-t5-experiments

Evaluating Flan-T5 on some benchmarks

To run:

python evaluate.py [model_size] -b [benchmark] -n [number of questions]

model_size options: 'small', 'base', 'large', 'xl', 'xxl'

benchmark options: 'sat', 'mmlu'