songmzhang/DSKD
Repo for the EMNLP'24 Paper "Dual-Space Knowledge Distillation for Large Language Models".
Python
Issues
- 3
- 3
Is vanilla KD for same vocab equivalent to Minimum Edit Distance for different vocab?
#22 opened by survivebycoding - 1
- 1
The code works only with dev and train set, and not with test set. Right?
#21 opened by survivebycoding - 6
load 72B teacher model
#13 opened by ypw-lbj - 2
Files for token mapping
#20 opened by ntsw2001 - 5
Quantify difference in vocabulary
#19 opened by srikhetramohanty - 4
Failed to reproduce KD results
#18 opened by cpsu00 - 9
Reproduction of results
#15 opened by mathamateur - 2
GPT2-1.5B Pretrained Teacher on Dolly
#17 opened by cpsu00 - 2
Evaluation script error with TinyLlama
#12 opened by srikhetramohanty - 6
using mistral from
#14 opened by survivebycoding - 15
Concern regarding performance
#10 opened by survivebycoding - 2
- 3
- 10
- 1
- 1
Can we use this code for CPU?
#6 opened by survivebycoding - 2
- 4
- 4
Usage with other model combinations
#3 opened by botox-100 - 3
About SeqKD with different vocabularies
#2 opened by 2018cx - 1
关于 AKL 的计算
#1 opened by wutaiqiang