spakin/awk

bufio.Scanner: token too long

suntong opened this issue · 10 comments

I'm using the awk package with my real problem first time but the program aborted with merely an error message:

bufio.Scanner: token too long

The thing is that my real problem is XML files with lines in huge base64 encoded string that easily goes over several-M to ten-Ms. (Update: line can be as long as 22,326,102 or 22,294,550 chars long or even longer)

Can the awk package handle this?

PS. I wasn't processing the huge base64 encoded string for now. All that I was trying to do was to skipping over them and focus on the lines that are rather short:

    s.AppendStmt(func(s *awk.Script) bool {
        return s.F(1).Match("<Comment")
    }, func(s *awk.Script) { ... }

Further, if I pre-filter those huge lines out from input, my program works correctly. That confirms my guessing.

I didn't know this until just now, but according to the bufio source code,

// MaxScanTokenSize is the maximum size used to buffer a token
// unless the user provides an explicit buffer with Scan.Buffer.
// The actual maximum token size may be smaller as the buffer
// may need to include, for instance, a newline.
MaxScanTokenSize = 64 * 1024

Because the awk package doesn't specify otherwise, tokens are limited to 64KB. I believe I can simply invoke Scanner.Buffer to increase the maximum token size. What do you think of my adding a numeric field to awk.Script that a program can use to specify the maximum token (i.e., record or field) length?

— Scott

Yeah, good idea, that's exactly what I have in mind while reading your message. Please do. Thanks a lot!

Of course, make its default value to zero, and only change Scanner.Buffer's maximum token size when not empty.

You can now set an awk.Script's MaxRecordSize to the maximum record size and its MaxFieldSize to the maximum field size you want to use. I decided not to make zero special, though. Instead, awk.NewScript initializes those fields to their actual default values. This enables programs to query the current sizes.

Thanks a lot, Scott.

Yes, of course, that makes perfect sense.

Just I was having problem updating the package. Tried several times but always get the same error:

$ go get -v -u github.com/spakin/awk
github.com/spakin/awk (download)
github.com/spakin/awk
# github.com/spakin/awk
/export/repo/go-arch/src/github.com/spakin/awk/script.go:776: fsScanner.Buffer undefined (type *bufio.Scanner has no field or method Buffer)
/export/repo/go-arch/src/github.com/spakin/awk/script.go:822: sc.rsScanner.Buffer undefined (type *bufio.Scanner has no field or method Buffer)
/export/repo/go-arch/src/github.com/spakin/awk/script.go:863: s.rsScanner.Buffer undefined (type *bufio.Scanner has no field or method Buffer)

$ go version
go version go1.5.1 linux/amd64

UPDATE: Fixed after ugrade to go 1.6. Sorry about the noise.

However, I have a very strange error using it, and have sent you an off-line email to the address you listed on https://github.com/spakin/awk/. Please check it out. Thanks.

I, too, received type *bufio.Scanner has no field or method Buffer errors when I first introduced the call to Scanner.Buffer. It turns out that Scanner.Buffer is a recent addition to the standard library. All you need to do is upgrade to the latest version of Go (v1.6), and the error should go away.

— Scott

It looks like it went into my spam folder. I see it now. When I get a chance I'll try to diagnose the behavior you're seeing.

— Scott

Yep, the problem has now been fixed, exactly as you described in the email.
FYI, my script went through that ~50M test file (with ~22M long line) just fine.

To recap,

set s.MaxRecordSize if line is too long, and
set s.MaxFieldSize if column is too long.

Thanks a lot Scott

set s.MaxRecordSize if line is too long, and
set s.MaxFieldSize if column is too long.

Yep, that's it. Glad your code is working now.

For the record, for others reading this thread, the problem was that @suntong was processing a file with very long lines. awk could handle the long lines when s.MaxRecordSize was set, but not the long fields within the lines unless s.MaxFieldSize was also set. Alas, awk failed to propagate overflow errors induced by splitting a record into fields, which led to some puzzling behavior (e.g., fields being replaced by other fields). As of commit 93b2afc, Script.Run properly returns an error code if a field in an input record overflows s.MaxFieldSize.

— Scott