cneud/ocr-gt

IIT CDIP 1.0 (Illinois Institute of Technology Complex Document Information Processing Test Collection, version 1.0)

kba opened this issue ยท 38 comments

cneud commented

added in 07ac808

@kba are you able to access the dataset? I get a "permission denied" error

@kba are you able to access the dataset? I get a "permission denied" error

Hi, did you solve this problem? I also got the forbidden.

@kba or anyone, could you provide a mirror possibly ?

cneud commented

@Bonjour123 I am afraid we don't have access either, this file just gathers the metadata for the datasets.

An excerpt from CDIP seems to be available publicly here https://www.cs.cmu.edu/~aharley/rvl-cdip/, but this also links back to the URL for the main datasets that returns a 403 now.

Yes, in the rvl-cdip site, they do state that it's available publicly at ir.nist.gov/cdip, so I think that it has been and after removed from public access. But I have hope that someone has a copy somewhere ..

Did anyone find a copy? Would really appreciate it if you could post the link.

Anyone manage to get your hands on the data? I emailed one of the authors yesterday but no response so far. If anyone else would like to email them you can track them down on researchgate.

https://www.researchgate.net/publication/221299542_Building_a_test_collection_for_complex_document_information_processing

Anyone manage to get your hands on the data? I emailed one of the authors yesterday but no response so far. If anyone else would like to email them you can track them down on researchgate.

https://www.researchgate.net/publication/221299542_Building_a_test_collection_for_complex_document_information_processing

Hi Brain, have you received any reply?

Hi, this is Ian Soboroff from NIST. We do indeed still host the collection, but only open access to the files on request. Web crawlers kept getting trapped in the image file directories and loading down our server.

kba commented

Hi Ian, thanks for reaching out! I would like to get access to it :) Perhaps uploading it in bulk to zenodo might be a good idea, they have lots of bandwidth. If you want, I can also host a mirror.

cneud commented

Zenodo would be a good idea for hosting since they will provide a DOI as well.

Otherwise, perhaps some documentation could be added again as https://ir.nist.gov/cdip/README.txt with the information that files are available upon request?

I changed how the directory is protected, so that everyone can get to the README file. That file describes that you can get the OCR data, and to contact me for access to the raw TIFF images (which are about 1.3TB).

@isoboroff I want to download the raw TIFF images (which are about 1.3TB). Can you share the dataset? Thank you.

@isoboroff Hi, Can you share the dataset? Thank you.

I am working out an alternate hosting setup that will hopefully be easier on all of us.

Iโ€™ve been asked if this hosting setup could include demo code. Would anyone here like to help out?

@isoboroff
Is the alternate hosting setup done??
Also, for the demo code, if it's in Python I can help with that.

mark

Those files are DVD disk images, and they contain the OCR text from the original page scans. The collection was originally distributed on DVD. Ian
โ€ฆ
On Mon, Aug 2, 2021 at 12:56 AM TaekyungKi @.***> wrote: @isoboroff https://github.com/isoboroff Thanks for the reply anyway! I downloaed CDIP_x.cdr file on the link: https://ir.nist.gov/cdip/ . but i'm not sure this is right files because of the size of the files. Can you tell me how can i use these files? โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB4U5CNPRZO3IPVWGR5YFDT2YQO5ANCNFSM4IZGEA3Q .

@isoboroff I want the source image , How can I contact to you? I can't see the email address.

@isoboroff
lan, my colleague sent an mail about asking about about access at IIT-CDIP test dataset.
but there is no answer, so we ask at here.
is there some hosting site for acess IIT-CDIP dataset what img set contain OCR text data?
how can i get that data?
thanks

@Cogdof Did you get the data already? I want to use the data for research. Can you share the data if you got them?

@etrigger
i contacted lan with email.
also, these dataset is so huge. there is no method to send this data effiecently.
I think that you shall send mail to him is best.

@Cogdof stil no available link, only email him, right?

Hi, folks. We are still working on the alternate hosting (hopefully in Amazon S3) but these things take time. I will try to respond to download requests as quickly as I can... if you don't get a reply from me, it's ok to ping me again.

@isoboroff
I have sent you two emails this week from "at amazon dot com". It will be helpful if you can take a quick look. thanks.

At long last, the image data has been re-hosted. You can now find it at https://data.nist.gov/od/id/mds2-2531. Hopefully, transfers will now be faster (but it is quite big, so it will take some time!)

At long last, the image data has been re-hosted. You can now find it at https://data.nist.gov/od/id/mds2-2531. Hopefully, transfers will now be faster (but it is quite big, so it will take some time!)

Excuse me~
This url(https://data.nist.gov/od/id/mds2-2531) might not be accessed. https://data.nist.gov/pdr/od/id shows "Empty Record".
Is this dataset locked again? Could you please tell me how can I get the source images which is 1.3TB?
Looking forward to your reply, thank you~

It works! Thank you so much!
Have a nice day!

๋งˆ์นจ๋‚ด ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์‹œ ํ˜ธ์ŠคํŒ…๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ https://data.nist.gov/od/id/mds2-2531 ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค . ์ด์ œ ์ „์†ก์ด ๋” ๋นจ๋ผ์ง€๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค(ํ•˜์ง€๋งŒ ๊ฝค ์ปค์„œ ์‹œ๊ฐ„์ด ์ข€ ๊ฑธ๋ฆด ๊ฒƒ์ž…๋‹ˆ๋‹ค!).

I checked these files in that link.

If i look at the xml file that comes out after decompressing cdip-n.tar, it looks like each image has its own xml file.
But I didn't find the xml file for each image.
I'd like to know where the xml files for each image are.

Can you give me some information for it ??

Thank you for your support :)

Hi,
Is the link https://data.nist.gov/od/id/mds2-2531 still valid, I cannot access the data!

Is this dataset still publicly available? Is there any valid way to download this data?

Is this dataset still publicly available? Is there any valid way to download this data?

https://data.nist.gov/od/id/mds2-2531 , this link can still be accessed, you can download IIT-CDIP data here
image

Is there any method I can access to the dataset?

hello, I can download the IIT-CDIP-annotations. How do images correspond to json๏ผŸ
ไผไธšๅพฎไฟกๆˆชๅ›พ_17101610043571
I want to download the diagram corresponding to json
ไผไธšๅพฎไฟกๆˆชๅ›พ_171016109629