Fix testing URLs

Question

Fix testing URLs

stefan-balke opened this issue 5 years ago · 14 comments

The test is failing, mainly due to 403s.

This can be fixed by using fake-useragent in the call:

code = urllib.request.urlopen(values['url'], timeout=5, headers={'User-Agent' : ua.random).getcode()

Answer 1 · 2019-06-18T09:14:54.000Z

The only one missing is:

ERROR: tests/test_datasets.py::test_url[HJDB]

Answer 2 · 2019-06-18T09:22:55.000Z

We could ask Jason about it (https://www.bcu.ac.uk/digital-technology/about-us/our-expertise/our-staff/jason-hockman)

Answer 3 · 2019-06-18T10:00:47.000Z

Sent him a mail.

Answer 4 · 2019-06-19T03:01:12.000Z

If I remember correctly, this dataset HJDB used to be available publicly and then was made private intentionally. I kept it in the list regardless, not sure what my reasoning was at that time. Alexander

Answer 5 · 2019-06-19T07:11:26.000Z

Yes, Jason responded. It's part of the MIREX downbeat task and no longer available...

Answer 6 · 2019-06-21T08:02:53.000Z

holzapfel:onset is also no longer working since his website is offline.

Answer 7 · 2019-06-21T08:03:49.000Z

Maybe it's this one? https://github.com/CPJKU/onset_db

Answer 8 · 2019-06-24T09:34:48.000Z

It's here:
"https://kth.box.com/s/o151l3rqtglhmeszah06wmvpcmpat6w9

Das sind die Daten wie wir sie im folgenden Artikel verwendet haben:
Holzapfel, A., Stylianou, Y., Gedik, A. C., & Bozkurt, B. (2010). Three
dimensions of pitched instrument onset detection. IEEE Transactions on
Audio, Speech, and Language Processing, 18(6), 1517–1527"

Answer 9 · 2019-06-26T11:02:04.000Z

Okay, fixed that one. Will put the rest into an archive.yaml

Answer 10 · 2019-06-26T11:04:00.000Z

Or another field binary field which says "available: True/False"?

Pro: It's all in one file.

Con: Naive consumers will probably not take this flag into account...

Opinions?

Answer 11 · 2019-11-16T13:52:27.000Z

doing a little housekeeping between this and ismir-home, and noticed that there are a bunch of dataset URLs (13/348) that seem to be broken. It seems that there a higher level issue here than when we started, being:

how do we triage broken URLs when they get stepped on?

There are generally two cases this will come up:

A contributor trying to do something else triggers a build that shakes out a new failure
A scheduled build (we should run it every week) finds a new broken link

We could have different strategies for 1 & 2, but it seems like one answer is fine for now. Some (not mutually exclusive) options that address both cases:

tag failing URLs as "stale / not available / missing", maybe mark the rows as red / gray?
re-order the dataset table to have two sections, one for available and one for unavailable / archive
add a "maintainer" field to each database record so that we know who to bother when things break
don't make a passing build conditioned on all URLs being available.

In general, I vote for one YAML file, at least flagging missing / unavailable datasets in the output JS table and markdown, and maybe blessing Travis-CI with the ability to update the YAML based on what's healthy. In the meantime we could run this process manually. Whether we use pytest or a separate script to update / modify the YAML is a matter of philosophy, since it's probably bad practice for py.test to modify files it takes as input (but maybe that's just too strict).

Answer 12 · 2019-11-16T16:54:44.000Z

I would vote for keeping them all in the list, however, mark them somehow as you suggested (my old website used strike-through). It would be good to mark them offline as of date of test, as I noticed that many university dataset pages tend to be offline surprisingly regularly but then come back online eventually.
A maintainer field would implicitly be there once we add the DOI/paper reference as planned, so I don't think it's explicitly needed.

Answer 13 · 2019-11-16T23:24:11.000Z

ah, good call on the DOI / paper reference, I'd forgotten about that. That'll more than fulfill the need to chase someone down about broken / stale links... maybe also incentivize getting datasets hosted on Zenodo.

so it sounds like we have a proposal then:

bad links don't break the build
links get marked as green/available or red/unavailable in the resulting table
link health is ephemeral and not logged in the yaml
we can provide a top-level timestamp as to when the table was last verified

sound like a plan? anything missing on this issue?

Answer 14 · 2020-07-06T18:57:03.000Z

well done @ejhumphrey
#28 got merged!