[Bug]: snapshots of big slices could not be loaded back
Closed this issue ยท 5 comments
Description
The bufferio scanner instanced in getPrevSnapshot() uses default arguments, which means that the maximum token size for scanning a single line caps at 64*1024 bytes.
I've a test snapshot storing a slice of 5k string elements, which gets dumped into a single line of ~130k characters.
As a temporary workaround I'm splitting out the slice in smaller chunks and snapshotting each one of them, but I think the snapshot should consider the MaxScanTokenSize
limit and break the dumped data into multiple lines.
Steps to Reproduce
- Generate a slice whose content would be dumped into a single line of >65K characters
- use it to produce a snapshot
- attempt to reload the snapshot,
getPrevSnapshot()
will returnerrSnapNotFound
Expected Behavior
The snapshot creation should take into account the scan buffer size limits, and split data among multiple lines.
Hey ๐ , thanks for openning this very detailed issue, I didn't know there is such a limit on bufio. I was indeed able to reproduce this with
// CI=true go test -v -count=1 ./... -run TestLongLine
func TestLongLine(t *testing.T) {
t.Run("should capture long lines", func(t *testing.T) {
snaps.MatchSnapshot(t, strings.Repeat("hello world", 10000))
})
}
The snapshot creation should take into account the scan buffer size limits, and split data among multiple lines.
This is a good idea but I tried to fix the error by simply increasing the MaxScanTokenSize
s.Buffer([]byte{}, math.MaxInt)
do you thing there is any concern doing this ๐ค ?
Hi!
thanks for openning this very detailed issue, I didn't know there is such a limit on bufio.
Thank you for taking the time of reviewing the issue as well.
The limit on bufio's Scan()
is something I've discovered myself as well investigating a test failure which randomly started to happen with a grown-up snapshot.. ๐
This is a good idea but I tried to fix the error by simply increasing the MaxScanTokenSize
do you thing there is any concern doing this ๐ค ?
Well, MaxScanTokenSize
seems big enough for any practical snapshot, and will surely help.
I can think about two possible concerns:
- Memory allocation: the size we're talking about here is for the buffer being built for a single line to be written to the destination file, and that's going to grow up in your process' memory. This might mean that, with a big enough snapshot to process (or either too strict resources available), the operative system might decide to kill your process for being too eager in requesting memory.
Scan()
is iteratively doubling its buffer's size until it fits the content there's to write, I guess then it aims to a logarithmic complexity on the input length. I don't think it will practically impact on performances all that much, but still enlarging the buffer might represent a suboptimal solution.
Your proposal looks way simpler and desirable, and would effectively work as a quick fix. I just fear it might hit back in the long term. The solution I proposed looks overall safer but I understand it might require some non-trivial refactoring.
One last note: I'm not a JestJS expert (I've found it thanks to your project's reference), I'm curious how it handles this case. I couldn't find much about.
The limit on bufio's Scan() is something I've discovered myself as well investigating a test failure
Should have def checked the error of scanner and return it would have made the failure easier to track. Will make sure to add this instead of just returning errSnapNotFound
.
One last note: I'm not a JestJS expert (I've found it thanks to your project's reference), I'm curious how it handles this case. I couldn't find much about.
I couldn't find a definitive answer as well but under the hood jest is using new Function
to parse the whole snapshot file so I would guess the limit is again the system. Also a good related issue is this jestjs/jest#8551
Your proposal looks way simpler and desirable, and would effectively work as a quick fix. I just fear it might hit back in the long term. The solution I proposed looks overall safer but I understand it might require some non-trivial refactoring.
Setting the limit to as high as possible and not trying to split the input is because the library will struggle in other places even if it can parse the line (e.g. consolidtating the snapshot and create a diff) also maybe after a point it's probably not very practical doing snapshot testing as the snapshot is probably not readable or easy to review.
Btw not sure if this is apparrent but this method is not allocating a buffer of size math.MaxInt
It sets the initial buffer to use when scanning and the maximum size of buffer that may be allocated during scanning.
Initial size here being zero and maximum math.MaxInt
so it will behave the same just allow to grow bigger.
s.Buffer([]byte{}, math.MaxInt)
Implementation https://github.com/golang/go/blob/master/src/bufio/scan.go#L258-L273
Initial size here being zero and maximum math.MaxInt so it will behave the same just allow to grow bigger.
Yes, that was what I meant with:
Scan() is iteratively doubling its buffer's size until it fits the content there's to write, I guess then it aims to a logarithmic complexity on the input length.
The fix looks ok. Thanks for the prompt action :)