edgi-govdata-archiving/web-monitoring-processing

Put wayback API in a separate package.

danielballan opened this issue · 9 comments

The code that interacts with the CDX API and the Memento API would be a useful unit for applications other than ours. It might also be something that the folks at IA would be willing to adopt or co-maintain with us. The scope of this issue is to separate that code from the code that interacts with web-monitoring-db and its schema, and then put it in a separately repo and publish it as a separate package on PyPI.

This package would be complementary to https://archive.org/services/docs/api/internetarchive/ which addresses the Item API to the archive at large. This new package would focus on APIs to Wayback specifically. We cover CDX and Memento; we might eventually add timemap. There is an existing package addressing similar scope https://pypi.org/project/wayback/ but it seems to be abandoned (last update 2013).

Pretty much all the code involved here (excepting tests and so on) is already in the internetarchive module, too.

I think the pieces that don’t belong (or that should possibly be different) are:

  • format_version()
  • WaybackClient.timestamped_uri_to_version() (which mostly just ties WaybackClient.get_memento() together with format_version())

One other question I have in my head is: WaybackClient.get_memento() returns an actual HTTP response object from the requests package. I’m not totally confident that’s a good level of abstraction for a more public package (e.g. what if we wanted to change the HTTP library we use?), but I think we can also address that question after extracting this into a package.

Also, for reference, the conversations on Internet Archive’s Slack about this:

Per discussion today on naming…

  • We’ll call this package wayback-client (or waybackclient?)
  • @Mr0grog will try and get in touch with the maintainer of wayback and see if we can take that over. It hasn’t been updated in 6 years and covers similar functionality (but without all the extra tooling we’ve added for robustness here).

Progress! I had left off weeks ago mid-struggle with Circle CI caches. Now Circle builds pass.

We agreed to adopt ReadTheDocs instead of the system we currently use for web-monitoring-processing involving Travis-CI and GitHub Pages orchestrated by doctr. I have added ReadTheDocs configuration; the docs now automatically build on RTD and they are published at https://wayback.readthedocs.io/en/latest/.

I have not yet done any of the API rework discussed in this issue. I still intend to do that, but now that CI and docs are working other contributions can be made in parallel.

I tagged an alpha release and attempted to publish it as a prerelease on PyPI and got this error:

$ twine upload dist/*
Enter your username: danielballan
Enter your password: 
Uploading distributions to https://upload.pypi.org/legacy/
Uploading wayback-0.1.0a2-py3-none-any.whl
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25.1k/25.1k [00:00<00:00, 50.0kB/s]
NOTE: Try --verbose to see response content.
HTTPError: 403 Client Error: The credential associated with user 'danielballan' isn't allowed to upload to project 'wayback'. See https://pypi.org/help/#project-name for more information. for url: https://upload.pypi.org/legacy/

The linked section of the docs discussing claiming squatted/abandoned names. The URL http://pypi.org/project/wayback gives a 404. Any idea what I can do from here, @Mr0grog?

Building on this comment maybe list_versions should be reworked as well. It provides a "simplified API", which arguably serves to obfuscate the available options that the CDX API, and it offers the option to skip repeats. Maybe it could be replaced by search_unique which wraps search (same signature, not a "simplified" one) that does the skip-repeats thing.

I tagged an alpha release and attempted to publish it as a prerelease on PyPI and got this error:

I didn’t know your PyPI account name, so you’ll need to log in with the envirodgi account and use it to add yourself (I sent you the password back when we set it up; let me know if you need it again).

Building on this comment maybe list_versions should be reworked as well.

I’d actually be in favor of just removing it. It originally made a lot of sense for our use case, right up until we realized it was anti-helpful for our use-case :P (Today it just supports a feature we never actually use!)

Right, I had forgotten. I scrolled up in Slack and refreshed my memory on all this. Thanks for your patience. Released as https://pypi.org/project/wayback/0.2.0a1/ since, as you noted, the previous package named wayback left off at 0.1.

Sure, that works for me. Seems like format_version, WaybackClient.timestamped_uri_to_version and WaybackClient.list_versions could all be removed them. I don't think they have generic utility. We can of course keep utility functions in web-monitoring-processing for transforming the output of WaybackClient.get_memento into our web-monitoring-db object structure.

👍 makes sense to me!

Closed by #511. The ideas for the future of the code in wayback raised in this module have been captured as issues in https://github.com/edgi-govdata-archiving/wayback/issues.