RyannDaGreat/MAGICK

Dead links

Closed this issue · 7 comments

Thank you for the amazing dataset! However, it seems like the server at https://vision.cs.stonybrook.edu is not configured and all links to the dataset are unreachable. It is linked from "Explore MAGICK" here: https://ryanndagreat.github.io/MAGICK/

I guess you already trained a neural network for alpha matting using this dataset. Could you tell us a bit about it? Did it work well?

Thank you for the interest! We're going to be getting it back online as soon as we can, someone in our university's IT department seems to have taken it down. We're going to try uploading it to Huggingface - please stay tuned! And yes, in fact I will be posting the model and code for a ControlNet too. It worked quite well!

Very nice! Looking forward to it!

Hey! I have an update about the MAGICK dataset
The dataset explorer is back online, please check out this URL: https://vision.cs.stonybrook.edu/ryan_adobe/magick_dataset_explorer_jun21_2024.html
(ignore the browser warnings, our university server doesn't have a certificate at the moment)
Again our project page is https://ryanndagreat.github.io/MAGICK
Soon we will complete an upload of this dataset to huggingface along with the controlnet model discussed in the paper, please stay tuned - I'll email you again with new updates!
Best,
Ryan Burgert

Great! I've checked a few random URLs and so far, they all worked! This issue is resolved, so I'll close it, but feel free to update me about the huggingface dataset and controlnet model. Thanks again!

Hey, we have another update:
The entire MAGICK dataset has been uploaded to Huggingface as of one hour ago!
Please see https://huggingface.co/datasets/OneOverZero/MAGICK
Please let me know if you have any difficulty using it - I am happy to make it easier for you!
Best,
Ryan

I have already started downloading the files from the department server, but they are named differently compared to the files on Huggingface. Is there some way to match the filenames? I could probably do it myself by extracting only the SHA-256 checksums from the Huggingface repository. Currently looking into how to do that. I should do that anyway, just to make sure that I did not have any file corruption when downloading.

EDIT: I figured it out! It is possible to clone the repository without downloading the files managed by git lfs:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/OneOverZero/MAGICK

Each image file will then contain content like the following text instead of the image bytes:

version https://git-lfs.github.com/spec/v1
oid sha256:01c9760f469a9436bc5d321bba14318e9aa9224ad5889e89ebe0cfb43013f5d7
size 1231608

EDIT2: When I came back to my computer, I noticed that the download from the department website had failed after about 100GB, so I cloned the huggingface repository instead. But git clone https://huggingface.co/datasets/OneOverZero/MAGICK was not the most efficient way to do it, since it stores the data twice:

$ du -h -d 2
3,2M    ./MAGICK/explorer
179G    ./MAGICK/images
179G    ./MAGICK/.git
357G    ./MAGICK

Yes! The structure is due to the limit of huggingface not letting us keep folders with more than 10k entries. It's reogranized so that each image is put in a subfolder titled after the first two characters of each image name. That's detailed in the readme