bigcode-project/starcoder2

What does "unique tokens" mean (in the paper) ?

yucc-leon opened this issue · 2 comments

For example, on page 16 it said "This leads to a dataset of 622B+ unique tokens. For the 7B, we include OpenWebMath, Wikipedia, and Arxiv, leading to a slightly larger dataset of 658B+ unique tokens. For the 15B, we include the-stack-v2-train-full dataset and all extra data sources listed in §2, resulting in a dataset with 913B+ unique tokens. The size of this dataset is 4× the size of the training dataset for StarCoderBase."
The question is, does the "unique tokens" mean there are such a number of tokens totally in the dataset after dedup~ or if you use the starcoderv2's tokenizer to tokenize the whole dataset, you can get such a huge vocab dict?

#3
The term "unique tokens" refers to the total count of distinct token instances in the dataset after tokenization, not the size of the vocabulary. When a dataset mentions "622B+ unique tokens," it means there are over 622 billion distinct occurrences of tokens identified using a specific tokenizer, like StarCoderV2's, after processing and deduplicating the dataset. It's about the total unique token occurrences, not the number of different words or symbols.

#3 The term "unique tokens" refers to the total count of distinct token instances in the dataset after tokenization, not the size of the vocabulary. When a dataset mentions "622B+ unique tokens," it means there are over 622 billion distinct occurrences of tokens identified using a specific tokenizer, like StarCoderV2's, after processing and deduplicating the dataset. It's about the total unique token occurrences, not the number of different words or symbols. 术语“唯一标记”是指标记化后数据集中不同标记实例的总数,而不是词汇表的大小。当数据集提到“622B+ 唯一标记”时,这意味着在处理和删除重复数据集后,使用特定标记生成器(例如 StarCoderV2 的标记生成器)识别出超过 6220 亿个不同的标记。它与唯一标记出现的总数有关,而不是不同单词或符号的数量。

Thanks for the clarification!