Extract diffs and diff server into separate repo & Python package
Mr0grog opened this issue · 7 comments
Following on the heels of #206 and #477, I believe it’s time to create a separate repo and Python package for the diffs and the diff server (i.e. for web_monitoring.diff
, web_monitoring.diff_server
, and the wm-diffing-server
command-line tool).
Why?
This is a big change, but it’s important and worthwhile because:
-
We’ve proven through time and through third-party usage that the diff tools here are reasonably abstract from EDGI’s Web Monitoring work and useful to others. (For example, the Internet Archive now runs the diff server as part of the Wayback Machine.)
-
Having it mixed in with other tools here means the dependencies are less clear, and the architectural and release needs of different parts of this project sometimes collide and get in each others’ way.
-
Outside users (read: the Internet Archive) have repeatedly asked for clearer release management and versioning, which has been tough to do well with all the cross-concerns in this repo (see above point).
-
EDGI’s Web Monitoring team is moving towards releasing more data publicly, and I’d like to be able to tell a clearer story about about how to analyze that data, reproduce our diffs for analysis, etc. That’s much easier to do when there is a focused Python package people can install and use.
The New Repo/Package
The new repo and package won’t be prefixed with “web-monitoring-” because one of the motivating factors here is that we are abstracting these tools from the Web Monitoring project. (This is also why the Wayback package is not “web-monitoring-wayback.”)
However, the new name is still up in the air. Possible candidates (please suggest others in the comments if you have thoughts):
- webdiffer
- web-diff
- webdiffs
- webdiff-suite
- diffweb
I would really love some suggestions here; “webdiff” is an existing project on PyPI, and I’m somewhat worried all these names are too close to that. (“Web” is in most of these to try and describe a little more clearly the area our diff algorithms are focused on, but also that they are more broad than just HTML. There’s probably a better choice out there that @danielballan and I haven’t thought of yet.)
Other important features of the new project:
-
The server should be an optional addition. It’s in the codebase, but you’d run:
$ pip install webdiffer[server]
To install it instead of:
$ pip install webdiffer
-
Some of the experimental diffs that we don’t actively use should also be optional dependencies. Specifically: https://github.com/anastasia/htmldiffer and https://github.com/danielballan/htmltreediff. They haven’t been especially active projects, and I think there are some dependency conflicts.
-
Should handle docs like the Wayback package does rather than how they are handled here.
-
Should be released as a package on PyPI, with clear versioning.
-
For now, this project should probably depend on the new one and re-export it, so things continue to work while external users update. After the Internet Archive and EDGI’s deployments have all migrated, we’ll remove it from here.
Timeframe
I’m looking to do this next week! I know that’s soon, but I’d like to do this before the Web Monitoring team does a public data release, and hopefully have time to iron out any issues.
To-Do
- Create new repo
- Filter history of this repo to just the relevant bits
- Rename package and internal references
- Upload to new repo
- Separate diff & server requirements
- Rewrite README
- Write minimal docs
- Set up readthedocs.org
- Get tests running in Circle
- Get docker builds running/publishing in Circle
- Move all relevant issues to the new repo
- Publish v0.1.0 to PyPI
- Deprecate diffs and diff server in this package
Just my 2 cents on the naming: Using differ
over diff
feels better because there's the additional server component. diff
makes me think that it's just a function/package that only diffs files. So I'd vote for webdiffer
, though I like web-differ
more, which wasn't suggested. Just a preference, no good reason to include the dash.
Also, if we're thinking more broadly about file types (especially if we get to PDF diffing), we could consider something like: file-differ
or doc-differ
? Just throwing those out there
doc-diff
is an existing package, so it has similar name-collision issues as web-diff
et al: https://pypi.org/project/doc-diff/ (That doesn’t disqualify it; just that it’s also a name that has be tread carefully around.)
I’m not super-keen on “file,” since we don’t really deal with files or filesystems, but it’s definitely an option. file-diff
is also an existing project: https://pypi.org/project/file-diff/
Yea, those do all have the same collision issues. It just hit me, but how about diff-scanner
? Searched PyPI and didn't find anything similar. Makes some sense 'cause it helps scan diffs (I think this was also a connotation from the original name?) and pays homage to the WM project =)
After more conversations with other folks on Slack, I’m leaning towards the less decision-laden “web-monitoring-diff.” It doesn't really name it as a stand-alone thing, but I think everybody’s having trouble coming up with a name that feels good and doesn’t conflict with anything already out there.
OK, a significant amount of this is now done at https://github.com/edgi-govdata-archiving/web-monitoring-diff
I still need to:
Set up new docs (nothing from the old processing docs was actually applicable to the diffs/server).Get the docs building on readthedocs.orgSplit up the requirements:Basic package requirementsServer requirements (should be installable as an extra, i.e.pip install web-monitoring-diff[server]
)Experimental diffs we install via git (the wrapper functions for these need to be moved out ofbasic_diffs.py
, too)Make dev requirements installable as an extra, i.e.pip install web-monitoring-diff[test]
Get tests running on CircleCIGet docker builds/releases running on CircleCI
What I have done:
- REWRITTEN HISTORY
- The README has been almost entirely rewritten.
setup.py
has been updated for this package.- Versioneer has been updated.
- Requirements have been slimmed down to just what's needed for this codebase.
- Requirements can be installed via
pip install web-monitoring-diff
orpip install web-monitoring-diff[server,dev]
.requirements.txt
files are not needed except for Docker and "experimental" diffs. - The server script has been renamed to
web-monitoring-diff-server
to match the package name. It's long, but now that we are treating this as more abstract, I'm leery of abbreviating as much. - Modules have been moved around and renamed.
web_monitoring_diff.diff.<whatever>
andweb_monitoring_diff.diff_server
both seemed pretty redundant, so we now have:web_monitoring_diff.<diff_function>
-- all the diff functions are exposed directly at the top level.- The actual submodules containing diff functions are named
<whatever>_diff.py
:basic_diffs.py
instead ofdiffers
to be more clear that these are relatively simple functions all thrown together.html_render_diff.py
instead ofhtml_diff_render.py
to fit the convention.html_links_diff.py
instead oflinks_diff.py
to clarify that this is for HTML documents.experimental_diffs.py
contains the diffs that are no longer actively used and were never especially well supported. They have to be installed by via git instead of PyPI, so setup is also kind of special. They get wrapped with a fancy bit of tooling that adds a.supported
property to them indicating whether the underlying experimental package has been installed.
web_monitoring_diff.exceptions
as a common standard for where our exception types belong.web_monitoring_diff.server
for the server. It's a subpackage since the content of the server module is so big. I think it's a pretty obvious future plan to refactor that megamodule into smaller files and this lets us do so.
deployment.md
has been removed -- it's no longer relevant.
@danielballan @lightandluck while I’m getting the last bits done, I’d love any feedback on what’s finished, especially the new README and all the renaming described above.
Updated: Requirements have been split up! Experimental, poorly supported diffs are sequestered away in experimental_diffs
with some extra tooling.
Updated Oct. 26: docs are now working and building on RTD.
Update: the different groups of dependencies have been split up:
- The core dependencies:
pip install web-monitoring-diff
- The server dependencies:
pip install web-monitoring-diff[server]
- The dev/test dependencies:
pip install web-monitoring-diff[dev]
- The “experimental” diffs:
pip install -r requirements-experimental.txt
(have to use a requirements file since they aregit
URLs)
The experimental diffs have some fancy tooling; not sure if I went overboard here. https://github.com/edgi-govdata-archiving/web-monitoring-diff/blob/b84ad467510fe47f6e3ef1a0ac6a3d408232cf9a/web_monitoring_diff/experimental_diffs.py#L22-L43
With this, you can still import them, but check whether they are supported before running. For example:
from web_monitoring_diff.experimental_diffs import html_tree_diff
html_tree_diff.supported
Not sure if I should have just kept it simpler and put each one in a separate module, and let importing those modules fail catastrophically. (And had the server wrap the imports in a try/catch.) If that might be better, I’ll change it.
Docs are now running on RTD.org: https://web-monitoring-diff.readthedocs.io/en/latest/
This took a little funny business due to RTD not supporting pycurl
, but luckily they have a functional workaround.