timtadh/zhang-shasha

Background on per-forest operations matrices?

Closed this issue · 9 comments

@erickrf, I'd be interested to know how you came to the approach of using a temporary, per-forest matrix of arrays of operations to accumulate operations in the forestdist computation. Is there another implementation out there that takes that approach?

Strikes me that there's a potentially tunable trade-off between allocating matrices of empty arrays and leaving a bigger operations[|T_1|][|T_2|] with integer-encoded operation types in memory along with treedist.

I'm thinking about this for a possible WASM implementation for the browser.

It just implemented what seemed to me as the simplest solution. I suppose there are more elaborate solutions with less overhead, as you suggest.

Thanks! And great job. Aesthetically, I really like how your approach tracks the space optimization of forestdist.

@kemitchell for that it is worth it looks like the original implementation of the js version you linked earlier was based on an earlier version of this repo (just based on the similarity of the code in its original distance function) -- so it should be "relatively easy" to integrate @erickrf approach to computing operations into it.

@timtadh, npm:edit-distance uses a single, global matrix of enumerated operation-type constants, and backtracks through it to create an operation sequence once distance has been computed. Otherwise, I agree it looks like a faithful port of your Python work, before @erickrf's contribution. Looks like the JS author then implemented operation tracking in their own way.

I've already started a new JS implementation, and have it working both with and without temporary operations matrices. I'm working through test suites and some comparisons to see whether they differ in the mappings they produce.

For your purposes, the most relevant part of my work may be bringing together an edit distance test suite as a JSON data file. Medium term, I want something like that for testing an eventual WASM implementation Zhang-Shasha, and potentially Pawlik-Augsten/APTED, if I become convinced it isn't patent-encumbered.

If you are worried about patents I would just ask Augsten directly (his email is on the TED site). Since he doesn't mention patents anywhere on the site I think you are in the clear. Otherwise, the complexity of the newer algorithms is significantly higher (see their APTED implementation). It seems easier to me to mechanically convert it to JS than to re-implement it. The reason for the added complexity (if I recall correctly) is to achieve good results it dynamically adjusts its approach based on the trees as it is computing the distance.

Also, if you are concerned about scalablity you might want to look into approximation methods -- Augsten has you covered there as well :-p . For instance, check out the PQ-Gram algorithm https://github.com/timtadh/PyGram

One final note, in a their most recent paper, they compare the number of subproblems (proxy for performance) between AP-TED+ and Zhang-Shasha (ZS):

image

Note, AP-TED+ is better on some trees but not others. As with many algorithms of this type (I work in Frequent Subgraph Mining) the data you are working with has a huge impact on the real world performance. That said, as the runtime figure shows, their AP-TED+ implementation is really good:

image

@timtadh, thanks for the info! I'll definitely drop Augsten a line re: patents.

I saw the Java APTED reference impl, and you're absolutely right: Not really looking forward to porting that, if that's what I need to do. Dynamic programming in not-so-dynamic languages. Always fun times.

Appreciate the link to your PQ-Gram implementation, as well. Looks like someone's already ported that to JS: https://www.npmjs.com/package/jqgram.

Do I correctly gather that PyGram will not return edit-operation sequences?

That is correct. It will give you an approximation of how different the trees are. I used it for approximate nearest neighbor indexing (using a VP-Tree) of abstract syntax tree fragments.