Strange result
Closed this issue · 4 comments
Hi, thanks for the great library. I am currently testing it a bit and one result is confusing for me.
import pydiff
l1 = "Hello John Doe"
l2 = "Hello Sue"
pydiff.diff_words(l1,l2)
[{'count': 2, 'value': 'Hello '}, {'count': 3, 'added': None, 'removed': True, 'value': 'John Doe'}, {'count': 1, 'added': True, 'removed': None, 'value': 'Sue'}]
Why is the count 2 e.g. on the first change or the count 3 for the removed change? The last change makes sense. Thanks for your time.
Hi, it's nice to hear, that the library is useful. In general, you may think about the count
field as the number of adjacent tokens, which were merged by the algorithm in order to produce the largest block possible.
So in your example the count 2 means, that the first diff block consists of 2 adjacent tokens: the word Hello
+ a delimiter (a single space in your case). The same holds for the second block, i.e.:
'John'
+ ' '
+ 'Doe'
.
Because I am using the diff_words and not diff_words_with_spaces method I was expecting the counting to be different. Is there way to achieve a different counting where only words are counted?
I wanted to calculate how similar two documents are using the Jaccard-Index and I am interested in how many words are different, not counting whitespace.
You can get a Jaccard index from the "bag of words" of each document. You just need to tokenize the documents (e.g. split over whitespace) and make sets out of the lists of words. Then you are able to compare them in the Jaccard fashion and apply any variation of the metric you need.
Using a diff for this seems like somewhat of an over-engineering.
Of course you are completely right, for calculating the Jaccard index I can do as you suggested, I missed that. I am showing the difference between two versions of a report, that why I am using jsdiff (browser) and wanted to calculate PyReadableDiff for storing them on the database. Another metric I wanted to use is "words added" and "words deleted". My idea was just to iterate over the change dict and add up the counts. Anyway I can just tokenize the value on whitespace and get the word count.
Thanks a lot to both of you for your time and have a good day!