Size mismatch
Closed this issue ยท 4 comments
Hellos,
For the Turkish part of OSCAR-2301 in the document, it's written to be 26,654,330 lines. However, when I go over the dataset I see only around 13.300.000 lines. Are some files are forgotten somehow?
Hello! How did you count this line count?
I'm iterating over all data for cleaning purposes. I basically went over the corpus by some big batch size, process each instance in the batch. Sth like this:
from datasets import load_dataset
import datasets
from torch.utils.data import DataLoader
from process_document2 import process_document
import json
from tqdm import tqdm
def collat3(data_dict):
texts = [dd["text"] for dd in data_dict]
ids = [dd["id"] for dd in data_dict]
return (texts, ids)
dataset = load_dataset("oscar-corpus/OSCAR-2301", use_auth_token=True, language="tr", streaming=True, split="train")
dataloader = DataLoader(dataset, batch_size=500000, collate_fn= collat3)
batch_no =1
for batch in dataloader:
all_sentences = []
file_name = "out.jsonl"
print(batch_no, "started!")
texts, ids = batch
with open(file_name, "a+") as ofile:
for text, idn in tqdm(zip(texts, ids), total=500000): # go over all the instances in the batch
processed_doc = process_document(text) # my cleaning code
if processed_doc is not None:
minijs = {"id": idn, "text": processed_doc}
minijs = json.dumps(minijs, ensure_ascii=False)
ofile.write(minijs+"\n")
print(batch_no, "finished")
batch_no +=1
At the end the script ended and the final id from the dataset was a bit more than 13M (I keep the original IDs from the corpus). Consequently batch size stopped at 27 (batch size is 0.5M). Hope I'm not missing anything ๐ผ
Each JSONline document hosts multiple lines, separated by \n
. Could you try to count those ones, and not documents?
Each JSONline document hosts multiple lines, separated by
\n
. Could you try to count those ones, and not documents?
Oh shacks, then things change a bit and most probably they'll add up to the original number. I thought lines == instances
, thanks for the clarification! We can close the thread.