bufio.Scanner: token too long
suntong opened this issue · 10 comments
I'm using the awk package with my real problem first time but the program aborted with merely an error message:
bufio.Scanner: token too long
The thing is that my real problem is XML files with lines in huge base64 encoded string that easily goes over several-M to ten-Ms. (Update: line can be as long as 22,326,102 or 22,294,550 chars long or even longer)
Can the awk package handle this?
PS. I wasn't processing the huge base64 encoded string for now. All that I was trying to do was to skipping over them and focus on the lines that are rather short:
s.AppendStmt(func(s *awk.Script) bool {
return s.F(1).Match("<Comment")
}, func(s *awk.Script) { ... }
Further, if I pre-filter those huge lines out from input, my program works correctly. That confirms my guessing.
I didn't know this until just now, but according to the bufio
source code,
// MaxScanTokenSize is the maximum size used to buffer a token
// unless the user provides an explicit buffer with Scan.Buffer.
// The actual maximum token size may be smaller as the buffer
// may need to include, for instance, a newline.
MaxScanTokenSize = 64 * 1024
Because the awk
package doesn't specify otherwise, tokens are limited to 64KB. I believe I can simply invoke Scanner.Buffer
to increase the maximum token size. What do you think of my adding a numeric field to awk.Script
that a program can use to specify the maximum token (i.e., record or field) length?
— Scott
Yeah, good idea, that's exactly what I have in mind while reading your message. Please do. Thanks a lot!
Of course, make its default value to zero, and only change Scanner.Buffer's maximum token size when not empty.
You can now set an awk.Script
's MaxRecordSize
to the maximum record size and its MaxFieldSize
to the maximum field size you want to use. I decided not to make zero special, though. Instead, awk.NewScript
initializes those fields to their actual default values. This enables programs to query the current sizes.
Thanks a lot, Scott.
Yes, of course, that makes perfect sense.
Just I was having problem updating the package. Tried several times but always get the same error:
$ go get -v -u github.com/spakin/awk
github.com/spakin/awk (download)
github.com/spakin/awk
# github.com/spakin/awk
/export/repo/go-arch/src/github.com/spakin/awk/script.go:776: fsScanner.Buffer undefined (type *bufio.Scanner has no field or method Buffer)
/export/repo/go-arch/src/github.com/spakin/awk/script.go:822: sc.rsScanner.Buffer undefined (type *bufio.Scanner has no field or method Buffer)
/export/repo/go-arch/src/github.com/spakin/awk/script.go:863: s.rsScanner.Buffer undefined (type *bufio.Scanner has no field or method Buffer)
$ go version
go version go1.5.1 linux/amd64
UPDATE: Fixed after ugrade to go 1.6. Sorry about the noise.
However, I have a very strange error using it, and have sent you an off-line email to the address you listed on https://github.com/spakin/awk/. Please check it out. Thanks.
I, too, received type *bufio.Scanner has no field or method Buffer
errors when I first introduced the call to Scanner.Buffer
. It turns out that Scanner.Buffer
is a recent addition to the standard library. All you need to do is upgrade to the latest version of Go (v1.6), and the error should go away.
— Scott
It looks like it went into my spam folder. I see it now. When I get a chance I'll try to diagnose the behavior you're seeing.
— Scott
Yep, the problem has now been fixed, exactly as you described in the email.
FYI, my script went through that ~50M test file (with ~22M long line) just fine.
To recap,
set s.MaxRecordSize
if line is too long, and
set s.MaxFieldSize
if column is too long.
Thanks a lot Scott
set s.MaxRecordSize if line is too long, and
set s.MaxFieldSize if column is too long.
Yep, that's it. Glad your code is working now.
For the record, for others reading this thread, the problem was that @suntong was processing a file with very long lines. awk
could handle the long lines when s.MaxRecordSize
was set, but not the long fields within the lines unless s.MaxFieldSize
was also set. Alas, awk
failed to propagate overflow errors induced by splitting a record into fields, which led to some puzzling behavior (e.g., fields being replaced by other fields). As of commit 93b2afc, Script.Run
properly returns an error code if a field in an input record overflows s.MaxFieldSize
.
— Scott