downloading datasets

Question

downloading datasets

Closed this issue 3 years ago · 4 comments

Hello,
Thanks for making your code available and your work is very clear and clean.
I was able to run the colab file demo on my end

I'm trying to reproduce the datasets through colab, I'm receiving the following error and similar errors when trying to download, any advice on this?

Connecting to acl-arc.comp.nus.edu.sg (acl-arc.comp.nus.edu.sg)|137.132.84.180|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-11-21 19:55:07 ERROR 403: Forbidden.```

Answer 1 · 2021-11-22T08:27:18.000Z

That's the ACL Anthology Reference Corpus (ACL ARC) and apparently the original source is offline. See

https://catalog.ldc.upenn.edu/docs/LDC2009T29/lrec_08/

Maybe contact the authors of the ACL ARC. In the meantime, I'll try to upload the dataset somewhere else.

Answer 2 · 2021-11-22T08:32:36.000Z

If you do not want to recreate our dataset but just want to reproduce our experiments, running this should be sufficient:

from nlp import load_dataset

# Training data for first CV split
train_dataset = load_dataset(
    './datasets/cord19_docrel/cord19_docrel.py',
    name='relations',
    split='fold_1_train'
)

Answer 3 · 2021-11-22T16:06:33.000Z

I see, thanks!

I found this DATA_URL = "http://datasets.fiq.de/acl_docrel.tar.gz" this should include the main ACL ARC corpus you are using, right?

Answer 4 · 2021-11-22T16:28:28.000Z

It's not the full ACL ARC but all paper data needed for training the models.