
about data

Opened this issue · 0 comments

May I ask if you can tell me how the sharegpt_clean. json file is changed to openchat_v3.2_super.train.parquet? I noticed that there is a lot of data difference between the two, some of which were truncated due to being too long, but I also noticed that some garbled data is also discarded. But there are still many data in sharegpt_clean where the Model field is not marked as GPT3.5 or GPT4. How does this part of the data determine whether it belongs to GPT3.5 or GPT4, or whether it belongs entirely to GPT3.5?