Size mismatch

Question

Size mismatch

Closed this issue 9 months ago · 4 comments

Hellos,
For the Turkish part of OSCAR-2301 in the document, it's written to be 26,654,330 lines. However, when I go over the dataset I see only around 13.300.000 lines. Are some files are forgotten somehow?

Answer 1 · 2024-02-17T12:03:06.000Z

Hello! How did you count this line count?

Answer 2 · 2024-02-17T12:12:28.000Z

I'm iterating over all data for cleaning purposes. I basically went over the corpus by some big batch size, process each instance in the batch. Sth like this:

from datasets import load_dataset
import datasets
from torch.utils.data import DataLoader
from process_document2 import process_document
import json
from tqdm import tqdm



def collat3(data_dict):
  texts = [dd["text"] for dd in data_dict]
  ids = [dd["id"] for dd in data_dict]
  return (texts, ids)


dataset = load_dataset("oscar-corpus/OSCAR-2301", use_auth_token=True, language="tr", streaming=True, split="train")
dataloader = DataLoader(dataset, batch_size=500000, collate_fn= collat3)

batch_no =1

for batch in dataloader:
  all_sentences = []
  file_name = "out.jsonl"
  print(batch_no, "started!")

  texts, ids = batch
  with open(file_name, "a+") as ofile:
    for text, idn in tqdm(zip(texts, ids), total=500000):  # go over all the instances in the batch
      processed_doc = process_document(text)  # my cleaning code
      if processed_doc is not None:
          minijs = {"id": idn, "text": processed_doc}
          minijs  = json.dumps(minijs, ensure_ascii=False)
          ofile.write(minijs+"\n")
     
  print(batch_no, "finished")
  batch_no +=1

At the end the script ended and the final id from the dataset was a bit more than 13M (I keep the original IDs from the corpus). Consequently batch size stopped at 27 (batch size is 0.5M). Hope I'm not missing anything 😼

Answer 3 · 2024-02-17T12:16:23.000Z

Each JSONline document hosts multiple lines, separated by \n. Could you try to count those ones, and not documents?

Answer 4 · 2024-02-17T12:19:10.000Z

Each JSONline document hosts multiple lines, separated by \n. Could you try to count those ones, and not documents?

Oh shacks, then things change a bit and most probably they'll add up to the original number. I thought lines == instances , thanks for the clarification! We can close the thread.