Problem with `La.parquet` size
StrangeTcy opened this issue · 3 comments
StrangeTcy commented
Demo_Data.yaml
references a lot of files that aren't published anywhere, and so we decided to edit it a little and create this version:
IMAGE_TEXT: # Group name should be in [IMAGE_TEXT, TEXT_ONLY, IMAGE_TEXT_IN_CONTEXT]
LADD: # LLaVA Detailed Description, dataset name can be assigned at any name you want
mimicit_path: otter_data/json/LA/LADD_instructions.json # Path of the instruction json file
images_path: otter_data/Parquets/LA.parquet # Path of the image parquet file
num_samples: -1 # Number of samples you want to use, -1 means use all samples, if not set, default is -1.
LACR_T2T:
mimicit_path: otter_data/json/LA/LACR_T2T_instructions.json
images_path: otter_data/Parquets/LA.parquet
num_samples: -1
and running the finetuning script on it leads to this error:
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2218271688
Luodian commented
Please refer the updated code here. Iteratively loading data would avoid the capacity error.
StrangeTcy commented
Nice, that helped, thanks!
StrangeTcy commented
You should add something like import pyarrow.parquet as pq
at some point, otherwise the interpreter gets confused