Luodian/Otter

Problem with `La.parquet` size

StrangeTcy opened this issue · 3 comments

Demo_Data.yaml references a lot of files that aren't published anywhere, and so we decided to edit it a little and create this version:

IMAGE_TEXT: # Group name should be in [IMAGE_TEXT, TEXT_ONLY, IMAGE_TEXT_IN_CONTEXT]
  LADD: # LLaVA Detailed Description, dataset name can be assigned at any name you want
      mimicit_path: otter_data/json/LA/LADD_instructions.json # Path of the instruction json file
      images_path: otter_data/Parquets/LA.parquet # Path of the image parquet file
      num_samples: -1 # Number of samples you want to use, -1 means use all samples, if not set, default is -1.
  LACR_T2T:
    mimicit_path: otter_data/json/LA/LACR_T2T_instructions.json
    images_path: otter_data/Parquets/LA.parquet
    num_samples: -1

and running the finetuning script on it leads to this error:

pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2218271688

Please refer the updated code here. Iteratively loading data would avoid the capacity error.

https://github.com/Luodian/Otter/blob/8b386816ec67b15833cde3dcd1d7ca6a752d2451/pipeline/mimicit_utils/mimicit_dataset.py#L222C27-L229

Nice, that helped, thanks!

You should add something like import pyarrow.parquet as pq at some point, otherwise the interpreter gets confused