CaPaTaZ – Dataset preparation for NLP tasks adapted from mesh-transformer-jax and gpt-neo repo scripts Clean and prepare and tokenize and z(s)plit your huge dataset into smaller files (currently only .pt supported) ready for training.