Training data distribution

Question

Training data distribution

pluiez opened this issue a year ago · 1 comments

Hi, the paper is very detailed in most aspects, but the training data is not mentioned in as much detail.

Specifically, I am interested in the following:

The composition of the training dataset, including the types of data (e.g., text, code, images) and the sources of the data.
How the sampling ratio for each subset is determined, e.g., which principle is followed.

Answer 1 · 2024-02-04T15:43:30.000Z

Hi, the paper is very detailed in most aspects, but the training data is not mentioned in as much detail.

Specifically, I am interested in the following:

The composition of the training dataset, including the types of data (e.g., text, code, images) and the sources of the data.

How the sampling ratio for each subset is determined, e.g., which principle is followed.

The details we can reveal so far are already in the paper.