ericcornelissen/wordrow

Use []byte and buffered readers to parse mapping files and process files

ericcornelissen opened this issue · 1 comments

Implementation update

  • wordrow version: v0.4.0-beta

Description

There is not really a reason to work with strings instead of []bytes so it would be nice if the source could be rewritten to work with the []bytes read from a file, all the way to writing []byte back to files.

This may also improve performance as we don't need to switch between the two (?)

Given the progress on the refactor/byte-slices - up to 88351c6 - this suggestion will be slightly adjusted. Analyses of Proof of Concept implementation using bufio.Reader and/or []byte through benchmarking suggests that using readers gives a huge performance boost, whereas using []byte instead of strings gives only a minor performance improvement or no improvement at all.

More specifically, I created the following Proof of Concept implementations

  • Reader-based CSV parser into a []byte-based mapping struct (*)
  • Reader-based CSV parser into a map[string]string
  • Reader-based MarkDown parser into a []byte-based mapping struct (*)
  • Reader-based MarkDown parser into a map[string]string
  • A []byte-based mapping implementation of the replace package (*)

And benchmarked them against each other and the current implementation. The result for the parsers are as follows (**):

Component Current Reader + [][]byte Reader + map[string]string
CSV parser 3572 ns/op, 22 allocs/op 2068 ns/op, 12 allocs/op 1750 ns/op, 12 allocs/op
MarkDown parser 4450 ns/op, 20 allocs/op 2690 ns/op, 7 allocs/op 2941 ns/op, 9 allocs/op

Similarly, the results for the replace package are as follows (**). Here, using buffered readers is not beneficial as the text is passed multiple times and edited each time.

Component Current (map[string]string) [][]byte
package replace 19121 ns/op, 15 allocs/op 17355 ns/op, 18 allocs/op

From this it is clear that the main advantage in the parsers comes from using readers. One can see that map[string]string has some potential disadvantages in the results for the MarkDown parser (too insignificant to show in the far simpler CSV parser).

However, the difference of map[string]string vs [][]byte in the replace package is rather minor. Since the advantage here is rather small, and even smaller for the parser, and the fact that map[string]string is much more intuitive to use to define a mapping programmatically, the refactoring scope of this issue will be reduced to just utilizing bufio.Reader for parser mapping files.


(*): map[[]byte][]byte is not possible in Go.
(**): on an 8-core i7 laptop