Index out of range panic in DiffCharsToLines on large JSON diff
krylovsk opened this issue ยท 16 comments
I've encountered this issue while using https://github.com/src-d/go-git, but the bug is easily reproducible with the code snippet below and the JSON file in attachment.
package main
import (
"fmt"
"io/ioutil"
"os"
"github.com/sergi/go-diff/diffmatchpatch"
)
func main() {
f, err := os.Open("data.txt")
defer f.Close()
checkErr(err)
data, err := ioutil.ReadAll(f)
checkErr(err)
// from https://github.com/src-d/go-git/blob/v4.0.0/utils/diff/diff.go#L17
dmp := diffmatchpatch.New()
wSrc, wDst, warray := dmp.DiffLinesToChars(string(data), "")
diffs := dmp.DiffMain(wSrc, wDst, false)
diffs = dmp.DiffCharsToLines(diffs, warray)
fmt.Println(diffs)
}
func checkErr(err error) {
if err != nil {
panic(err)
}
}
Output:
$ go run main.go
panic: runtime error: index out of range
goroutine 1 [running]:
github.com/sergi/go-diff/diffmatchpatch.(*DiffMatchPatch).DiffCharsToLines(0xc420044ee8, 0xc420078390, 0x1, 0x2, 0xc4202ce000, 0xd802, 0xec00, 0x1, 0x2, 0xc4202ce000)
/Users/krylovsk/src/github.com/sergi/go-diff/diffmatchpatch/diff.go:414 +0x394
main.main()
/tmp/go-diff-debug/main.go:24 +0x29e
exit status 2
Hi, I've also met this issue while iterating on large repositories (wanting to analyze diffs) with github.com/src-d/go-git
@krylovsk did you manage to find a workaround ? (I guess the problem is the same, it might be a large diff, and as I was testing this in linux git repository, it is hard and very long to reproduce)
@clafoutis42 no sorry I haven't figured that one out. Eventually I switched to https://github.com/libgit2/git2go
We encountered the same problem at source{d}, and @mcuadros is currently debugging it and looking for the fix.
I limited the scope of the problem with one file (64k lines), it contains serveral language from Thai Script to Arabic.
This is the fixture: https://gist.github.com/mcuadros/6369103d4838a9042613596ee212f32c#file-fixture-go
And the code to replicate the issues is:
package main
import (
"io/ioutil"
"log"
"github.com/sergi/go-diff/diffmatchpatch"
)
func main() {
sNew, err := ioutil.ReadFile("/tmp/dst")
if err != nil {
panic(err)
}
dmp := diffmatchpatch.New()
t1, t2, t := dmp.DiffLinesToChars("", string(sNew))
diffs := dmp.DiffMain(t1, t2, false)
diffs = dmp.DiffCharsToLines(diffs, t)
for _, diff := range diffs {
log.Println(diff.Type, diff.Text)
}
}
Anyone know if @sergi is still active? I was hoping to submit a PR for this fix (unless someone else is or has a fix). I'm running into this now consistently :(
I too faced the issue and had a look on why the issue is occurring. I was able to know what the issue but for the fix its not that easy as you need to know most of the logic. Its better people who wrote it can fix it better than me.
Please have a look on this example.
Here is a sample code of what happens when a rune array is converted to string and back to rune array. Sample: https://play.golang.org/p/XFVKayUhV27
func main() {
r := []rune{55296}
s := string(r)
r1 := []rune(s)
fmt.Println(r, s, r1)
}
Output :
[55296] ๏ฟฝ [65533]
In go-diff , similar thing is done and issue occurs because of this usage.
Issue:
If my file has lines is 55296 or more this issue arrises, and max 65530 should be there.
How the issue occurs?
In go-diff the lineArray index is stored in the rune array ( In diff.go , diffLinesToRunesMunge() function ).
This is converted to string and stored in 'Diff' structure 'Text' variable. (In diff.go , diffMainRunes function)
There is come processing with this Diff.Text and logic is a bit difficult to understand :) Time Constraint too
Now here in diff.go , DiffCharsToLines method we convert the Diff.Text to rune array and use it (as index) to get the line from line array. and this is when it get the 'index out of bound' error
I hope , I had put some light on this issue. :)
Thank You.... P@i
Hi folks,
Sorry about the silence. I'm back and I'll be looking into this issue soon, but I'd definitely appreciate it if one of you wants to contribute a PR to it.
Hi All,
I have done a fix for this issue and with my testing its working.
Issue:
As I explained above issue is storing the index value in the rune.
Fix:
Moved the index stored from rune to string array. Using delimeter "," the string array is combined to a string . This string is converted to rune array and returned for DiffLinesToRunes.
DiffCharsToLines method, split the index string, which is the diff using the delimeter and then use the index to fetch data from linearray.
The fix is in the last two commits of my go-diff fork. I have also changed the test cases and added test to check the a huge line diff testcase .
Branch: https://github.com/r-pai/go-diff/
Any comments before I create a PR.
Thanks
Pai
Hi all,
tl;dr I have a solution for this issue, but it requires lots of changes and also increase memory consumption by a factor of 2-4.
You can find it here.
Long analysis:
As @r-pai described, the issue is caused due to the limit on the values a rune
type can hold.
So, we need to change the "character" type passed to the diff functions to something which can hold larger values (i.e. int
).
But this is not enough - when using checklines mode (which is where the problem exists), the returned diffs are not "normal" text diffs, they are consist of runes which (supposedly) encode line numbers. So we also need a way to encode our new character type (i.e. int
) inside strings to be returned as encoded diffs.
As I see it, there are two possible ways to solve it:
- Duplicate each function being called from DiffMainRunes to have a second version receiving
[]int
instead of[]rune
as input, and returning[]int
instead ofstring
as diffs. Call those functions on diffLineMode. - Change all relevant functions to use
[]int
instead of[]rune
for everything, convert input from[]rune
on external functions, and restore output back to[]rune
at the end.
In addition, pass abool
argument to all relevant functions, stating whether the diffs should be encoded strings (eachint
encoded as 4 bytes) or "normal" text.
This is how I've implemented it - mostly because I wanted to avoid code duplication.
This explanation sounds much more complicated than the real changes...
So please take a look at https://github.com/ronhab/go-diff/tree/FixCrashUsingDiffChar .
BTW, it might be possible to choose option 1 without code duplication (at least not human-written code duplication...) by generating the two versions of the function automatically.
This way we can avoid some of the performance/memory consumption hits of my solution - but I don't have time to work on it right now.
For comparison, here are the benchmark results on both current master and my branch (running on my Windows 10 Intel i7-6700HQ laptop):
current master -
BenchmarkDiffCommonPrefix-8 5675502 222 ns/op 0 B/op 0 allocs/op
BenchmarkDiffCommonSuffix-8 5783984 199 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/prefix/empty-8 249642283 5.07 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/prefix/short-8 178517660 7.26 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/prefix/long-8 1313544 956 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/empty-8 189780298 5.70 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/short-8 131496415 8.71 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/long-8 1000000 1234 ns/op 0 B/op 0 allocs/op
BenchmarkDiffHalfMatch-8 7064 176486 ns/op 106496 B/op 2 allocs/op
BenchmarkDiffCleanupSemantic-8 775 1328013 ns/op 162149 B/op 538 allocs/op
BenchmarkDiffMain-8 1 1012293800 ns/op 16087728 B/op 81 allocs/op
BenchmarkDiffMainLarge-8 8 148727300 ns/op 4838944 B/op 45376 allocs/op
BenchmarkDiffMainRunesLargeLines-8 1366 811892 ns/op 156550 B/op 1056 allocs/op
My branch:
BenchmarkDiffCommonPrefix-8 3260402 369 ns/op 480 B/op 2 allocs/op
BenchmarkDiffCommonSuffix-8 3260413 365 ns/op 480 B/op 2 allocs/op
BenchmarkCommonLength/prefix/empty-8 298529592 4.09 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/prefix/short-8 209983783 5.72 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/prefix/long-8 1643733 727 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/empty-8 280116555 4.30 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/short-8 174342531 6.85 ns/op 0 B/op 0 allocs/op
BenchmarkCommonLength/suffix/long-8 1000000 1010 ns/op 0 B/op 0 allocs/op
BenchmarkDiffHalfMatch-8 5728 215256 ns/op 311298 B/op 4 allocs/op
BenchmarkDiffCleanupSemantic-8 1054 1150403 ns/op 355219 B/op 875 allocs/op
BenchmarkDiffMain-8 1 1020316100 ns/op 177532736 B/op 32877 allocs/op
BenchmarkDiffMainLarge-8 8 127642438 ns/op 7230068 B/op 84529 allocs/op
BenchmarkDiffMainRunesLargeLines-8 1800 661175 ns/op 227841 B/op 3860 allocs/op
It seems like some of the operations are actually faster, but memory allocations are much higher.
We just ran into this in the Go language server (gopls
): golang/go#42927. Is anybody working on a fix?
Hi @stamblerre, I have a potential fix for this that I plan to apply later today if all tests pass.
Awesome, thank you for the quick update!
Thank you for the quick fix on this, @sergi! Do you plan to tag a new release with this fix?