Is there any way to make customized dataset?
dongrixinyu opened this issue · 0 comments
dongrixinyu commented
I have tested the example in tutorial by train_gpt2fp32cu
. Here is the dataset file downloaded from huggingface.
// read in the (optional) command line arguments
const char* train_data_pattern = "dev/data/tinyshakespeare/tiny_shakespeare_train.bin";
const char* val_data_pattern = "dev/data/tinyshakespeare/tiny_shakespeare_val.bin";
As you know, the data structure of the bin file is quite complicated and trivial. Is there any way to make customized dataset .bin
file easily? from purely raw text dataset.