[Chatllama] Evaluation Function and Loop with metrics
PierpaoloSorbellini opened this issue · 0 comments
PierpaoloSorbellini commented
Description
Currently each training loop has an evaluation loop but it is not debugged nor used so far.
It needs to be generalised to be launched also outside the training activities, and to support specific language modelling metrics.
It would be nice if a report can be generated highlighting the performance achieved also in comparison with other models.
TODO
- Understand that libraries such as openai/evals or FastChat can be adapted to be used as an evaluation tool
- Debug Evaluation of the model.
- Collect and Compute relevant metrics.
- Launch the evaluation loop also outside the training.
- Produce a meaningful report that can compare the performance of one or more models.