blevesearch/segment

Error on large readers

Opened this issue · 15 comments

Hi! My environment is Go 1.14, macOS Catalina 10.15.3.

When reading a large file (over approx 65K), I am getting bufio.Segmenter: token too long

To reproduce: here’s a repo. Run the tests with go test -v.

I believe this error is in error, since no token in the file is that long. I expect to be able to use a reader of arbitrary length.

So either the returned error is inaccurate, or there is a bug in buffering/scanning the bytes. Thanks.

If curious I am using the segmenter here: https://github.com/clipperhouse/jargon. Thanks for creating it!

Thanks for the repro, makes this easier.

Unfortunately, here is what I've found. There doesn't appear to be an issue with the buffering/scanning of bytes. Basically, the buffering is working as expected, meaning that we keep the buffer filled, and hand these bytes over to the "segment words" black box function. I say black box, because that is the ragel state machine, with logic ported from java that will take a while for someone to reload into their head.

When calling into this black box we pass 2 things, the buffer, and a boolean indicating if we're at the end of input.

So, what happens when processing this data (both short and long).

  1. First 4096 bytes are read into the buffer.
  2. "segment words" is called with buffer(4096), and false (more data available)
  3. This returns the first token '{'
  4. "segment words" is called again with buffer(4095), same data, but slicing off the first byte we already consumed, and false (more data available)
  5. This returns the next token, also '{'
  6. "segment words" is called again with buffer (4094), again same data, but now we've sliced of the first 2 bytes, and false (more data available)
  7. This returns no token available.
  8. But, since there is more data available, we grow the buffer, read more data and try again.

At this point the 2 paths converge, for the short file, the entire data is read into memory. Once this happens, now when we call "segment words", the second argument is true (no more data). And this correctly returns the next token "redirect2".

For the long file, the entire file cannot be read into memory without exceeding a configured limit.

So, the question is, why is "segment words" behaving this way?

Most likely there is some improperly handled corner case in the logic. It would appear that it is in some greedy path trying to build a really big token, and only the EOF=true convinces it that it cannot build a larger token, and only then does it return the correct token.

So, that is my understanding at the moment, bug is in the ragel state machine code. But to be honest I'm not sure when I'll be able to look deeper.

Here are the detailed log messages, showing everything behaving the same, right up until short has consumed the entire file, and long exceeds the buffer. Notice how, in the short case, once everything has been consumed, it returns a 9 bytes token ("redirect2").

=== RUN   TestShort					      |	=== RUN   TestLong

Segment called, start: 0 end: 0					Segment called, start: 0 end: 0
main loop begins again: 0					main loop begins again: 0
buffer empty							buffer empty
buffer not full							buffer not full
reading into buffer						reading into buffer
read 4096 bytes							read 4096 bytes
main loop begins again: 1					main loop begins again: 1
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 1, token len 1, type: 0, err: <nil>		got advance: 1, token len 1, type: 0, err: <nil>
looped3 2 token len: 1						looped3 2 token len: 1

Segment called, start: 1 end: 4096				Segment called, start: 1 end: 4096
main loop begins again: 0					main loop begins again: 0
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 1, token len 1, type: 0, err: <nil>		got advance: 1, token len 1, type: 0, err: <nil>
looped3 1 token len: 1						looped3 1 token len: 1

Segment called, start: 2 end: 4096				Segment called, start: 2 end: 4096
main loop begins again: 0					main loop begins again: 0
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 0, token len 0, type: 0, err: <nil>		got advance: 0, token len 0, type: 0, err: <nil>
moving data up in buffer to make room, start: 2 end: 4096 buf	moving data up in buffer to make room, start: 2 end: 4096 buf
now start: 0, end: 4094						now start: 0, end: 4094
buffer not full							buffer not full
reading into buffer						reading into buffer
read 2 bytes							read 2 bytes
main loop begins again: 1					main loop begins again: 1
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 0, token len 0, type: 0, err: <nil>		got advance: 0, token len 0, type: 0, err: <nil>
buffer full							buffer full
growing the buffer						growing the buffer
main loop begins again: 2					main loop begins again: 2
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 0, token len 0, type: 0, err: <nil>		got advance: 0, token len 0, type: 0, err: <nil>
buffer not full							buffer not full
reading into buffer						reading into buffer
read 4096 bytes							read 4096 bytes
main loop begins again: 3					main loop begins again: 3
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 0, token len 0, type: 0, err: <nil>		got advance: 0, token len 0, type: 0, err: <nil>
buffer full							buffer full
growing the buffer						growing the buffer
main loop begins again: 4					main loop begins again: 4
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 0, token len 0, type: 0, err: <nil>		got advance: 0, token len 0, type: 0, err: <nil>
buffer not full							buffer not full
reading into buffer						reading into buffer
read 8192 bytes							read 8192 bytes
main loop begins again: 5					main loop begins again: 5
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 0, token len 0, type: 0, err: <nil>		got advance: 0, token len 0, type: 0, err: <nil>
buffer full							buffer full
growing the buffer						growing the buffer
main loop begins again: 6					main loop begins again: 6
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 0, token len 0, type: 0, err: <nil>		got advance: 0, token len 0, type: 0, err: <nil>
buffer not full							buffer not full
reading into buffer						reading into buffer
read 16384 bytes						read 16384 bytes
main loop begins again: 7					main loop begins again: 7
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 0, token len 0, type: 0, err: <nil>		got advance: 0, token len 0, type: 0, err: <nil>
buffer full							buffer full
growing the buffer						growing the buffer
main loop begins again: 8					main loop begins again: 8
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 0, token len 0, type: 0, err: <nil>		got advance: 0, token len 0, type: 0, err: <nil>
buffer not full							buffer not full
reading into buffer						reading into buffer
read 32118 bytes					      |	read 32768 bytes
main loop begins again: 9					main loop begins again: 9
trying to segment buffer between start and end			trying to segment buffer between start and end
got advance: 0, token len 0, type: 0, err: <nil>		got advance: 0, token len 0, type: 0, err: <nil>
buffer not full						      |	buffer full
reading into buffer					      |	buff len is %!s(int=65536)
read 0 bytes						      |	--- FAIL: TestLong (0.00s)
got err reading: EOF					      |	    segmenter_test.go:26: bufio.Segmenter: token too long
main loop begins again: 10				      |	    segmenter_test.go:29: saw 2 segments
trying to segment buffer between start and end		      |	FAIL
got advance: 9, token len 9, type: 2, err: <nil>	      |	exit status 1
looped3 11 token len: 9					      |	FAIL	github.com/clipperhouse/segmenter-repro	0.003s
							      <

Thanks for the quick response! On my end, a workaround might be to chunk the incoming reader and new up a segmenter every 64K bytes.

Another path would perhaps be to regenerate using the latest Ragel in the hope it might have been found/fixed in the interim.

Or, perhaps to hack in a bit of logic into segmentWords to return on whitespace instead of EOF.

Yeah I think chunking from the outside would work, but obviously that's undesirable because you might chunk in the middle of a token. But maybe finding whitespace near your desired chunk boundary is reasonable compromise depending on your input.

Also, not to discourage trying newer Ragel, but I suspect the bug is in our Ragel source file, not in the Ragel engine. If you look at this file:

https://github.com/blevesearch/segment/blob/master/segment_words.rl

You'll see the rules where based on seeing various runes, it's trying to build up a token of a particular type. You can see how the atEOF flag changes behavior in a number of places. My guess is that the bug is in this file, though I don't even remember how to go about debugging this. Parts of it turn into the generated Go code, so regular logging should work, but I believe that is limited to the actions, there might be some Ragel options to add debugging the states when generating the code.

Hopefully that gives you a few ideas.

Hi -- I took a quick peek, trying to follow this thread -- but am also struggling with the ragel.

I am sheepish to share this, but one thought that popped to mind was whether perhaps a brute-force bubble-gum band-aid might work. So, I hacked my local blevesearch/segment.MaxScanTokenSize from 64KB to 640KB (which ought to be enough for anyone?). And, then TestLong() unsurprisingly passed.

So one idea is perhaps change segment.MaxScanTokenSize from a const into a var, so that apps might be able to tweak it as an init()'ish escape valve until we have a real answer? Or add the maxScanTokenSize as a param as a new variant of NewSegmenter() or as a tweakable knob?

@steveyen sure, you can increase MaxScanTokenSize, I thought that was implied, but you're right it should be explicitly stated.

The real issue though is that if you expose an API that takes a Reader, it presumes some attempt to process the data without the entire thing in memory. What this bug demonstrates (I think, I have not fully verified this) is that for some cases (perhaps all) the method actually reads in all the data anyway. Presumably we didn't notice this because it does behave correctly, so long as the inputs fit into the buffer. But it is actually a quite severe bug.

@steveyen perhaps I should have addressed your suggested changes. While I'm not strongly opposed to them, I think they don't really help. We already have another method that works on []byte. So if you know that our implementation is going to read in all the bytes anyway (due to the bug), you could just do that yourself and use the alternate method.

https://github.com/blevesearch/segment/blob/master/segment.go#L84

So, I can't help but continue to look into this...

I have not looked at this code in a really long time, but this line here seems completely wrong:

https://github.com/blevesearch/segment/blob/master/segment_words.rl#L121

As I understand it, when atEOF is false, we always return without the token. Then because of the way the calling code works, it just keeps retrying with larger and larger buffer until it either all fits, and eventually is called with atEOF true, or it fails due to exceeding the max size.

What it seems should happen instead is that the code should check to make sure we're either atEOF or there is at least some more data beyond this token (ensuring there is no possibility that a longer token could be formed if only we saw more of the data). That check is slightly more complicated because we might also need to ensure there are at least 4 more bytes beyond the end of the token being constructed, to allow for a complete unicode code point.

I have some vague recollection considering this issue, but don't know how the code ended up as is. It's also possible I'm misunderstanding this code now.

Hehe, I thought you might not be able to resist. Happy to help you reason through it, though you obviously are more familiar than I.

A useful, grokkable goal might simply be to return (block) per token. No need to buffer ahead more than one, in all likelihood. Perf-wise, buffering matters most with I/O — can you just exploit the std lib bufio for that, and let emitted tokens be non-buffered?

OTOH , some small amount of buffering tokens (not just bytes) might turn out to be necessary. To ensure a whole code point, you’ll have to either look forward or look back.

(Pardon if I am stating the obvious.)

@clipperhouse the problem is that while this bug is severe, it doesn't affect the main consumer of this library (bleve) because it always does have all the data in memory. And for that use case, it's not so much that we buffer the tokens as we minimize the number of slices allocated by doing everything in a single pass. Avoiding/minimizing unnecessary allocations has been crucial to making bleve perform acceptably.

So, while I'd like to fix this bug, I do not want to break the other mode where we can find all the tokens in a single pass.

I got ragel built and installed again, so I'll see about getting everything building end-to-end again tonight.

OK, so I started to take deeper look at this today.

First, the suspected problem was this section: https://github.com/blevesearch/segment/blob/master/segment_words.rl#L120-L122

Basically, at this point the grammar has found something matching the definition of a word. The question is could it have built a longer word if it could see more of the data? Well normally the rules already try to do that, but if we're sourcing data from a reader, we didn't present the ragel state machine all the data, just some portion of it. So this particular section of code I linked above takes an extremely conservative view, unless we're really at the end of the data atEOF=true then pretend we couldn't match anything yet, with the expectation that the caller will try again with more data.

But, why take such a conservative view? Couldn't we just see if the token we matched went all the way to the end of the buffer? If our token went right to the edge of the buffer, then sure, be conservative, but otherwise we should be fine. So I tried this:

    if endPos+1 == pe && !atEOF {
      return val, types, totalConsumed, nil
    }

But, this doesn't work, one of the Unicode provided test cases is: 0x61, 0xe2, 0x81, 0xa0 and it tells us this should return a single token. But, when the buffering presents us the data: 0x61, 0xe2 we return 0x61 as it's own token (we weren't at the end of the buffer, we saw 0xe2).

OK, but we can fix this case too. What we can do is not just check that we're at the end of the buffer, but make sure there is a complete valid unicode codepoint. So, we change it this:

   if endPos+1 == pe && !atEOF {
      return val, types, totalConsumed, nil
    } else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 {
      return val, types, totalConsumed, nil
    }

That fixes the case above, but still fails in yet another Unicode provided test case: 0x61, 0x3a, 0x41 and it tells us this also should return as a single token. But, when even when our buffer contains 0x61, 0x3a we still return 0x61 as a token on it's own. That's incorrect, even though we saw a full Unicode codepoint beyond our match. The reason is that one of the segmentation rules WB6 says that colon (0x3a) is allowed to occur between two letters.

So, where does that leave us? It seems to me the correct solution needs some proper lookahead or backtracking (I'm not a grammar expert). And I'm probably not the right person to build that.

The other option would be to have some sort of user configurable option, either in number of bytes or number of runes, that would simply say, if we matched a word, and we have at least N more bytes/runes ahead in the buffer, assume that it's probably safe to return the token.

OK, I wanted to see how many runes ahead we would have to look to at least pass the tests provided by the Unicode spec.

I created a helper function, the intent being that after we've found a word token, we want to make sure the buffer beyond the end of our token has at least N valid runes.

func hasAtLeastNValidRunes(data []byte, n int) bool {
	var runeCount, consumedBytes int
	for runeCount < n {
		dr, size := utf8.DecodeRune(data[consumedBytes:])
		if dr == utf8.RuneError {
			return false
		}
		runeCount++
		consumedBytes += size
	}
	return true
}

I then tweaked the ragel action to do something like this:

    if !atEOF && !hasAtLeastNValidRunes(data[endPos+1:], N) {
      return val, types, totalConsumed, nil
    }

So, if at the point we matched this token, we are not at EOF, make sure we have at least N more valid runes.

And then I ran the test suite, and the answer is that N must be set to 4 in order for the tests to pass.

What I don't know is if that is correct, or just the longest case exposed by this test data set.

I think it would be pretty straightforward to expose this as something optional, and keep the old behavior by default. I think the goal of this library should be to be useful in practical situations. As is, the Reader version is arguably useless.

Even if you set N to 100, you'd have at most 400 bytes extra read into your buffer. And presumably that would be correct on most inputs you'd actually encounter in the real world.

Wow, thanks for digging in. So, trying to hypothesize what N represents here. Is it the least number of lookahead runes needed to disambiguate all cases — in which case 4 could be hard-coded with some confidence — or is it a coincidence relating to this test set?

To test the latter, maybe try an artificial example, perhaps including a colon as in WB6, with a very long token, and N much smaller than that token.

To make sure I understand, N needs to be at least 4 to pass, or exactly 4? Sounds like the former. If this were user configurable, that would de facto represent a max token size, I think. Which would be workable for me. I’d even go to 1024, I am happy to pay 4K of memory, if that’s the worst-case impact.

Two more questions, if you’ll indulge. Does your current experiment pass my long test? And, if we take the user-configurable N approach, does that get us ‘proper’ buffering, i.e., not-unlimited memory use?

Really appreciate your effort.

So, there are a number of cases where you can successfully match a word token, but if you had only seen N more Unicode runes, you would have matched an even longer token. This matters for the Reader case doing proper buffering because we're not passing the entire data to the ragel state machine at once, in fact we may never have the entire data in memory at once. So, back to the changes I'm proposing, you as the user of the Segmenter would get to choose the value of N you are comfortable with (any value of N is a heuristic, possibly leading to a tokenization behavior not described by the Unicode spec). You have to choose N greater than or equal to 4 just to pass the Unicode provided tests. Choosing a larger value of N may further decrease the chance of tokenization errors, but it also means more data in the buffer before returning tokens (N * 4 bytes).

N does not directly relate to the maximum token length that can be matched. It's the maximum number of Unicode runes we must read after a potential token to make sure we couldn't produce an even longer token.

With these changes in place your TestLong passes.

Yes, I believe the buffering logic was already working correctly, it was the segmenter itself demanding more data, and the buffer just growing to accommodate until it couldn't grow any more. With the segmenter fixed, to only look N runes ahead (not requiring it to go all the way to EOF), I believe the buffering works as originally intended.

Thumbs-up from me on such a fix.

The case you describe, missing the even-longer-token, is that an edge case due to certain Unicode rules that I as a Latin-script-biased person would be unfamiliar with? Or is it broadly true of any token?

The configurable N would be great regardless. A token larger than 100 or 1000 is not a use case for a tool like mine, that would be some weird data.