google-research/FLAN

Recommended caching method

gahdritz opened this issue · 1 comments

I'm trying to finetune T5x on the FLAN collection, exactly as in Longpre et al. 2023 and Chung et al. 2022. I'm starting with the small checkpoint.

Could you recommend a data caching scheme? run_example.py recommends storing examples on disk and then mixing them manually. seqio recommends using seqio.CacheDatasetPlaceholder instead, and I see that this function appears in the FLAN Collection source code. It's not clear to me how many examples from each mixture to store in either case, especially taking packing into account. Any tips?

Hey @gahdritz, thank you for your question! Yes, we did use caching in our internal repo, as our infra is based on Seqio, T5x, etc as well. This links to the caching script and outlines how to run it: https://github.com/google/seqio#optional-offline-caching.