/capataz

Primary LanguagePython

CaPaTaZ – Dataset preparation for NLP tasks

adapted from mesh-transformer-jax and gpt-neo repo scripts

Clean and prepare and tokenize and z(s)plit your huge dataset into smaller files (currently only .pt supported) ready for training.