ismir/mir-datasets

Fix testing URLs

stefan-balke opened this issue · 14 comments

The test is failing, mainly due to 403s.

This can be fixed by using fake-useragent in the call:

code = urllib.request.urlopen(values['url'], timeout=5, headers={'User-Agent' : ua.random).getcode()

The only one missing is:

ERROR: tests/test_datasets.py::test_url[HJDB]

Sent him a mail.

Yes, Jason responded. It's part of the MIREX downbeat task and no longer available...

holzapfel:onset is also no longer working since his website is offline.

It's here:
"https://kth.box.com/s/o151l3rqtglhmeszah06wmvpcmpat6w9

Das sind die Daten wie wir sie im folgenden Artikel verwendet haben: 
Holzapfel, A., Stylianou, Y., Gedik, A. C., & Bozkurt, B. (2010). Three 
dimensions of pitched instrument onset detection. IEEE Transactions on 
Audio, Speech, and Language Processing, 18(6), 1517–1527"

Okay, fixed that one. Will put the rest into an archive.yaml

Or another field binary field which says "available: True/False"?

Pro: It's all in one file.

Con: Naive consumers will probably not take this flag into account...

Opinions?

doing a little housekeeping between this and ismir-home, and noticed that there are a bunch of dataset URLs (13/348) that seem to be broken. It seems that there a higher level issue here than when we started, being:

how do we triage broken URLs when they get stepped on?

There are generally two cases this will come up:

  1. A contributor trying to do something else triggers a build that shakes out a new failure
  2. A scheduled build (we should run it every week) finds a new broken link

We could have different strategies for 1 & 2, but it seems like one answer is fine for now. Some (not mutually exclusive) options that address both cases:

  • tag failing URLs as "stale / not available / missing", maybe mark the rows as red / gray?
  • re-order the dataset table to have two sections, one for available and one for unavailable / archive
  • add a "maintainer" field to each database record so that we know who to bother when things break
  • don't make a passing build conditioned on all URLs being available.

In general, I vote for one YAML file, at least flagging missing / unavailable datasets in the output JS table and markdown, and maybe blessing Travis-CI with the ability to update the YAML based on what's healthy. In the meantime we could run this process manually. Whether we use pytest or a separate script to update / modify the YAML is a matter of philosophy, since it's probably bad practice for py.test to modify files it takes as input (but maybe that's just too strict).

I would vote for keeping them all in the list, however, mark them somehow as you suggested (my old website used strike-through). It would be good to mark them offline as of date of test, as I noticed that many university dataset pages tend to be offline surprisingly regularly but then come back online eventually.
A maintainer field would implicitly be there once we add the DOI/paper reference as planned, so I don't think it's explicitly needed.

ah, good call on the DOI / paper reference, I'd forgotten about that. That'll more than fulfill the need to chase someone down about broken / stale links... maybe also incentivize getting datasets hosted on Zenodo.

so it sounds like we have a proposal then:

  • bad links don't break the build
  • links get marked as green/available or red/unavailable in the resulting table
  • link health is ephemeral and not logged in the yaml
  • we can provide a top-level timestamp as to when the table was last verified

sound like a plan? anything missing on this issue?

well done @ejhumphrey
#28 got merged!