edgi-govdata-archiving/web-monitoring-processing

Extract diffs and diff server into separate repo & Python package

Mr0grog opened this issue · 7 comments

Following on the heels of #206 and #477, I believe it’s time to create a separate repo and Python package for the diffs and the diff server (i.e. for web_monitoring.diff, web_monitoring.diff_server, and the wm-diffing-server command-line tool).

Why?

This is a big change, but it’s important and worthwhile because:

  • We’ve proven through time and through third-party usage that the diff tools here are reasonably abstract from EDGI’s Web Monitoring work and useful to others. (For example, the Internet Archive now runs the diff server as part of the Wayback Machine.)

  • Having it mixed in with other tools here means the dependencies are less clear, and the architectural and release needs of different parts of this project sometimes collide and get in each others’ way.

  • Outside users (read: the Internet Archive) have repeatedly asked for clearer release management and versioning, which has been tough to do well with all the cross-concerns in this repo (see above point).

  • EDGI’s Web Monitoring team is moving towards releasing more data publicly, and I’d like to be able to tell a clearer story about about how to analyze that data, reproduce our diffs for analysis, etc. That’s much easier to do when there is a focused Python package people can install and use.

The New Repo/Package

The new repo and package won’t be prefixed with “web-monitoring-” because one of the motivating factors here is that we are abstracting these tools from the Web Monitoring project. (This is also why the Wayback package is not “web-monitoring-wayback.”)

However, the new name is still up in the air. Possible candidates (please suggest others in the comments if you have thoughts):

  • webdiffer
  • web-diff
  • webdiffs
  • webdiff-suite
  • diffweb

I would really love some suggestions here; “webdiff” is an existing project on PyPI, and I’m somewhat worried all these names are too close to that. (“Web” is in most of these to try and describe a little more clearly the area our diff algorithms are focused on, but also that they are more broad than just HTML. There’s probably a better choice out there that @danielballan and I haven’t thought of yet.)

Other important features of the new project:

  • The server should be an optional addition. It’s in the codebase, but you’d run:

    $ pip install webdiffer[server]

    To install it instead of:

    $ pip install webdiffer
  • Some of the experimental diffs that we don’t actively use should also be optional dependencies. Specifically: https://github.com/anastasia/htmldiffer and https://github.com/danielballan/htmltreediff. They haven’t been especially active projects, and I think there are some dependency conflicts.

  • Should handle docs like the Wayback package does rather than how they are handled here.

  • Should be released as a package on PyPI, with clear versioning.

  • For now, this project should probably depend on the new one and re-export it, so things continue to work while external users update. After the Internet Archive and EDGI’s deployments have all migrated, we’ll remove it from here.

Timeframe

I’m looking to do this next week! I know that’s soon, but I’d like to do this before the Web Monitoring team does a public data release, and hopefully have time to iron out any issues.

To-Do

  • Create new repo
  • Filter history of this repo to just the relevant bits
  • Rename package and internal references
  • Upload to new repo
  • Separate diff & server requirements
  • Rewrite README
  • Write minimal docs
  • Set up readthedocs.org
  • Get tests running in Circle
  • Get docker builds running/publishing in Circle
  • Move all relevant issues to the new repo
  • Publish v0.1.0 to PyPI
  • Deprecate diffs and diff server in this package

Just my 2 cents on the naming: Using differ over diff feels better because there's the additional server component. diff makes me think that it's just a function/package that only diffs files. So I'd vote for webdiffer, though I like web-differ more, which wasn't suggested. Just a preference, no good reason to include the dash.

Also, if we're thinking more broadly about file types (especially if we get to PDF diffing), we could consider something like: file-differ or doc-differ? Just throwing those out there

doc-diff is an existing package, so it has similar name-collision issues as web-diff et al: https://pypi.org/project/doc-diff/ (That doesn’t disqualify it; just that it’s also a name that has be tread carefully around.)

I’m not super-keen on “file,” since we don’t really deal with files or filesystems, but it’s definitely an option. file-diff is also an existing project: https://pypi.org/project/file-diff/

Yea, those do all have the same collision issues. It just hit me, but how about diff-scanner? Searched PyPI and didn't find anything similar. Makes some sense 'cause it helps scan diffs (I think this was also a connotation from the original name?) and pays homage to the WM project =)

After more conversations with other folks on Slack, I’m leaning towards the less decision-laden “web-monitoring-diff.” It doesn't really name it as a stand-alone thing, but I think everybody’s having trouble coming up with a name that feels good and doesn’t conflict with anything already out there.

OK, a significant amount of this is now done at https://github.com/edgi-govdata-archiving/web-monitoring-diff

I still need to:

  • Set up new docs (nothing from the old processing docs was actually applicable to the diffs/server).
  • Get the docs building on readthedocs.org
  • Split up the requirements:
    • Basic package requirements
    • Server requirements (should be installable as an extra, i.e. pip install web-monitoring-diff[server])
    • Experimental diffs we install via git (the wrapper functions for these need to be moved out of basic_diffs.py, too)
    • Make dev requirements installable as an extra, i.e. pip install web-monitoring-diff[test]
  • Get tests running on CircleCI
  • Get docker builds/releases running on CircleCI

What I have done:

  • REWRITTEN HISTORY
  • The README has been almost entirely rewritten.
  • setup.py has been updated for this package.
  • Versioneer has been updated.
  • Requirements have been slimmed down to just what's needed for this codebase.
  • Requirements can be installed via pip install web-monitoring-diff or pip install web-monitoring-diff[server,dev]. requirements.txt files are not needed except for Docker and "experimental" diffs.
  • The server script has been renamed to web-monitoring-diff-server to match the package name. It's long, but now that we are treating this as more abstract, I'm leery of abbreviating as much.
  • Modules have been moved around and renamed. web_monitoring_diff.diff.<whatever> and web_monitoring_diff.diff_server both seemed pretty redundant, so we now have:
    • web_monitoring_diff.<diff_function> -- all the diff functions are exposed directly at the top level.
    • The actual submodules containing diff functions are named <whatever>_diff.py:
      • basic_diffs.py instead of differs to be more clear that these are relatively simple functions all thrown together.
      • html_render_diff.py instead of html_diff_render.py to fit the convention.
      • html_links_diff.py instead of links_diff.py to clarify that this is for HTML documents.
      • experimental_diffs.py contains the diffs that are no longer actively used and were never especially well supported. They have to be installed by via git instead of PyPI, so setup is also kind of special. They get wrapped with a fancy bit of tooling that adds a .supported property to them indicating whether the underlying experimental package has been installed.
    • web_monitoring_diff.exceptions as a common standard for where our exception types belong.
    • web_monitoring_diff.server for the server. It's a subpackage since the content of the server module is so big. I think it's a pretty obvious future plan to refactor that megamodule into smaller files and this lets us do so.
  • deployment.md has been removed -- it's no longer relevant.

@danielballan @lightandluck while I’m getting the last bits done, I’d love any feedback on what’s finished, especially the new README and all the renaming described above.


Updated: Requirements have been split up! Experimental, poorly supported diffs are sequestered away in experimental_diffs with some extra tooling.

Updated Oct. 26: docs are now working and building on RTD.

Update: the different groups of dependencies have been split up:

  • The core dependencies: pip install web-monitoring-diff
  • The server dependencies: pip install web-monitoring-diff[server]
  • The dev/test dependencies: pip install web-monitoring-diff[dev]
  • The “experimental” diffs: pip install -r requirements-experimental.txt (have to use a requirements file since they are git URLs)

The experimental diffs have some fancy tooling; not sure if I went overboard here. https://github.com/edgi-govdata-archiving/web-monitoring-diff/blob/b84ad467510fe47f6e3ef1a0ac6a3d408232cf9a/web_monitoring_diff/experimental_diffs.py#L22-L43

With this, you can still import them, but check whether they are supported before running. For example:

from web_monitoring_diff.experimental_diffs import html_tree_diff
html_tree_diff.supported

Not sure if I should have just kept it simpler and put each one in a separate module, and let importing those modules fail catastrophically. (And had the server wrap the imports in a try/catch.) If that might be better, I’ll change it.

Docs are now running on RTD.org: https://web-monitoring-diff.readthedocs.io/en/latest/

This took a little funny business due to RTD not supporting pycurl, but luckily they have a functional workaround.