lucas-ventura/CoVR

[Question] Double checking but the Images for toptee, shirt, and dress shouldn't be separate right?

Closed this issue · 6 comments

          Double checking but the Images for toptee, shirt, and dress shouldn't be separate right?

Also if some of the links don't work will that cause problems or will the data just be less accurate? (Trying to get it to run first before trying to worry about accuracy).

Originally posted by @Agarciahunter in #13 (comment)

Hi!

Regarding the images for toptee, shirt, and dress, I had all the images in the same directory (see fashioniq-base.yaml config file). If your setup has them in separate directories, you could try updating the img_dirs paths in each config/data/fashioniq-split.yaml like this:

img_dirs:
  train: ${data.dataset_dir}/images/<split>
  val: ${data.dataset_dir}/images/<split>

I'm not entirely sure if this will work when testing, but it should be fine for training.


For missing image links, it seems that FashionIQDataset expects all the images. You can handle missing images in several ways:

  1. Redownload missing images using the provided links in the README.
  2. Adapt the FashionIQDataset to skip missing files, similar to what I did in WebVidCoVRDataset.
  3. Alternatively, create a reduced dataset: cp -r annotation/fashion-iq annotation/fashion-iq_small, remove the files you don't have, and change configs/data/fashioniq-base.yaml with the new annotation directory.

Best of luck!

I was able to download the missing images. Though after running train.py I seem to have come across another issue. Was the download for a 2M missing or is it something else? I noticed that 2M doesn't come with validation links either.

[2024-05-09 09:49:22,736][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Error executing job with overrides: []
Error executing job with overrides: []
Error in call to target 'src.data.webvid_covr.WebVidCoVRDataModule':
AssertionError('Embedding directory /work/user/CoVR/datasets//WebVid/2M/blip-vid-embs-large-all does not exist')

Edit: also using the checkpoints you provided does imply that everything else is working. The only problem now is this:

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.16 GiB (GPU 1; 31.74 GiB total capacity; 19.99 GiB already allocated; 3.09 GiB free; 28.20 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So I'll just need to try either doing a slurm file, or mess with the data size.

Update on the test.py and the fashion IQ. Running the code with the test code results in errors like this which is weird since the images are in the directory:

AssertionError('Path to candidate B00CZ7QJUG not found in /work/user/CoVR/datasets/fashion-iq/images')
full_key: test.fashioniq-shirt
(covr) [gpu033: CoVR]$ find . -name "B00CZ7QJUG*"
./datasets/fashion-iq/MissingImages/JPG_Images/B00CZ7QJUG.jpg
./datasets/fashion-iq/images/B00CZ7QJUG.jpg

AssertionError('Path to candidate B005X4PL1G not found in /work/user/CoVR/datasets/fashion-iq/images')
full_key: test.fashioniq-dress
(covr) [gpu033: CoVR]$ find . -name "B005X4PL1G*"
./datasets/fashion-iq/images/B005X4PL1G.jpg

AssertionError('Path to candidate B008CFZW76 not found in /work/user/CoVR/datasets/fashion-iq/images')
full_key: test.fashioniq-toptee
(covr) [gpu033: CoVR]$ find . -name "B008CFZW76*"
./datasets/fashion-iq/images/B008CFZW76.jpg

Any Ideas?

Hi @Agarciahunter,

It looks like you're encountering two different issues:

  1. Embeddings Extraction: Based on your error logs, it seems that you haven't extracted the target embeddings for WebVid nor FashionIQ. You can do this with the following commands:

    # This will compute the embeddings for the WebVid-CoVR videos. 
    # Note that you can use multiple GPUs with --num_shards and --shard_id
    python tools/embs/save_blip_embs_vids.py --video_dir datasets/WebVid/2M/train --todo_ids annotation/webvid-covr/webvid2m-covr_train.csv 
    
    # This will compute the embeddings for the WebVid-CoVR-Test videos.
    python tools/embs/save_blip_embs_vids.py --video_dir datasets/WebVid/8M/train --todo_ids annotation/webvid-   covr/webvid8m-covr_test.csv 
    
    # This will compute the embeddings for FashionIQ images.
    python tools/embs/save_blip_embs_imgs.py --image_dir datasets/fashion-iq/images/
  2. Memory Management: To manage GPU memory more efficiently and avoid the CUDA out of memory error, you can adjust the number of devices and batch sizes. Here’s how you can modify these settings:

    trainer.devices=X  # replace X with the number of GPUs
    machine.batch_size=Y  # replace Y with a suitable batch size

    I used trainer=ddp with SLURM for distributed training.

Please try these suggestions and let me know if you encounter further issues.

Unfortunately that didn't seem to fix the missing fashion images issues. Granted I restarted it to see if any changes I made to files is causing the problem. I'll keep you posted on if it doesn't work.

Also on the plus side I was able to get the code to run without a slurm job by setting the batch size to 256.

Side note what does changing the num_workers effect? haven't seen a difference going from 4 to 8.

Did you fix the issues?