Use []byte and buffered readers to parse mapping files and process files

Question

Use []byte and buffered readers to parse mapping files and process files

ericcornelissen opened this issue 4 years ago · 1 comments

Implementation update

wordrow version: v0.4.0-beta

Description

There is not really a reason to work with strings instead of []bytes so it would be nice if the source could be rewritten to work with the []bytes read from a file, all the way to writing []byte back to files.

This may also improve performance as we don't need to switch between the two (?)

Answer 1 · 2020-12-17T14:30:12.000Z

Given the progress on the refactor/byte-slices - up to 88351c6 - this suggestion will be slightly adjusted. Analyses of Proof of Concept implementation using bufio.Reader and/or []byte through benchmarking suggests that using readers gives a huge performance boost, whereas using []byte instead of strings gives only a minor performance improvement or no improvement at all.

More specifically, I created the following Proof of Concept implementations

Reader-based CSV parser into a []byte-based mapping struct (*)
Reader-based CSV parser into a map[string]string
Reader-based MarkDown parser into a []byte-based mapping struct (*)
Reader-based MarkDown parser into a map[string]string
A []byte-based mapping implementation of the replace package (*)

And benchmarked them against each other and the current implementation. The result for the parsers are as follows (**):

Component	Current	Reader + `[][]byte`	Reader + `map[string]string`
CSV parser	3572 ns/op, 22 allocs/op	2068 ns/op, 12 allocs/op	1750 ns/op, 12 allocs/op
MarkDown parser	4450 ns/op, 20 allocs/op	2690 ns/op, 7 allocs/op	2941 ns/op, 9 allocs/op

Similarly, the results for the replace package are as follows (**). Here, using buffered readers is not beneficial as the text is passed multiple times and edited each time.

Component	Current (`map[string]string`)	`[][]byte`
`package replace`	19121 ns/op, 15 allocs/op	17355 ns/op, 18 allocs/op

From this it is clear that the main advantage in the parsers comes from using readers. One can see that map[string]string has some potential disadvantages in the results for the MarkDown parser (too insignificant to show in the far simpler CSV parser).

However, the difference of map[string]string vs [][]byte in the replace package is rather minor. Since the advantage here is rather small, and even smaller for the parser, and the fact that map[string]string is much more intuitive to use to define a mapping programmatically, the refactoring scope of this issue will be reduced to just utilizing bufio.Reader for parser mapping files.

(*): map[[]byte][]byte is not possible in Go.
(**): on an 8-core i7 laptop