bigscience-workshop/evaluation

Setup testing

Opened this issue · 2 comments

#56 set up a basic unit test, but we have to consider what kind of tests we want to run. This is especially important given that GitHub workflows does not have any GPU support, and will thus take a non-trivial amount of time to complete even a basic simple benchmark run. The proposal is to ideate some ways in which we could make tests modular and reasonably fast.

Also CC'ing @tianjianjiang for potential suggestions and input.

@jaketae

Also CC'ing @tianjianjiang for potential suggestions and input.

take a non-trivial amount of time to complete even a basic simple benchmark run. The proposal is to ideate some ways in which we could make tests modular and reasonably fast.

Thanks for looping me in.

My two cents:

Personally, I tend to only test public interfaces at the highest level of a project's user stories. In this case, perhaps a separation: validate the data of each task, but test eval with small or even mock data as long as they are representative?

A benchmark run here indeed can be quite slow, and based on my recent experience with modeling-metadata repo, for a test of a roughly 5-iteration training with CPU via huggingface/accelerate, it took almost 1 minute to complete.

That said, I noticed that we may try to using load_dataset(streaming=True) for some dataset (if NOT in .gz), and sometimes we can even try to commit cached files via git-lfs as artifacts for testing.