deepseek-ai/DeepSeek-LLM

Training data distribution

pluiez opened this issue · 1 comments

pluiez commented

Hi, the paper is very detailed in most aspects, but the training data is not mentioned in as much detail.

Specifically, I am interested in the following:

  • The composition of the training dataset, including the types of data (e.g., text, code, images) and the sources of the data.
  • How the sampling ratio for each subset is determined, e.g., which principle is followed.

Hi, the paper is very detailed in most aspects, but the training data is not mentioned in as much detail.

Specifically, I am interested in the following:

  • The composition of the training dataset, including the types of data (e.g., text, code, images) and the sources of the data.
  • How the sampling ratio for each subset is determined, e.g., which principle is followed.

The details we can reveal so far are already in the paper.