karpathy/llm.c

Is there any way to make customized dataset?

dongrixinyu opened this issue · 0 comments

I have tested the example in tutorial by train_gpt2fp32cu. Here is the dataset file downloaded from huggingface.

    // read in the (optional) command line arguments
    const char* train_data_pattern = "dev/data/tinyshakespeare/tiny_shakespeare_train.bin";
    const char* val_data_pattern = "dev/data/tinyshakespeare/tiny_shakespeare_val.bin";

As you know, the data structure of the bin file is quite complicated and trivial. Is there any way to make customized dataset .bin file easily? from purely raw text dataset.