Use []byte and buffered readers to parse mapping files and process files
ericcornelissen opened this issue · 1 comments
Implementation update
- wordrow version:
v0.4.0-beta
Description
There is not really a reason to work with string
s instead of []byte
s so it would be nice if the source could be rewritten to work with the []byte
s read from a file, all the way to writing []byte
back to files.
This may also improve performance as we don't need to switch between the two (?)
Given the progress on the refactor/byte-slices - up to 88351c6 - this suggestion will be slightly adjusted. Analyses of Proof of Concept implementation using bufio.Reader
and/or []byte
through benchmarking suggests that using readers gives a huge performance boost, whereas using []byte
instead of strings
gives only a minor performance improvement or no improvement at all.
More specifically, I created the following Proof of Concept implementations
- Reader-based CSV parser into a
[]byte
-based mapping struct (*) - Reader-based CSV parser into a
map[string]string
- Reader-based MarkDown parser into a
[]byte
-based mapping struct (*) - Reader-based MarkDown parser into a
map[string]string
- A
[]byte
-based mapping implementation of the replace package (*)
And benchmarked them against each other and the current implementation. The result for the parsers are as follows (**):
Component | Current | Reader + [][]byte |
Reader + map[string]string |
---|---|---|---|
CSV parser | 3572 ns/op, 22 allocs/op | 2068 ns/op, 12 allocs/op | 1750 ns/op, 12 allocs/op |
MarkDown parser | 4450 ns/op, 20 allocs/op | 2690 ns/op, 7 allocs/op | 2941 ns/op, 9 allocs/op |
Similarly, the results for the replace package are as follows (**). Here, using buffered readers is not beneficial as the text is passed multiple times and edited each time.
Component | Current (map[string]string ) |
[][]byte |
---|---|---|
package replace |
19121 ns/op, 15 allocs/op | 17355 ns/op, 18 allocs/op |
From this it is clear that the main advantage in the parsers comes from using readers. One can see that map[string]string
has some potential disadvantages in the results for the MarkDown parser (too insignificant to show in the far simpler CSV parser).
However, the difference of map[string]string
vs [][]byte
in the replace package is rather minor. Since the advantage here is rather small, and even smaller for the parser, and the fact that map[string]string
is much more intuitive to use to define a mapping programmatically, the refactoring scope of this issue will be reduced to just utilizing bufio.Reader
for parser mapping files.
(*): map[[]byte][]byte
is not possible in Go.
(**): on an 8-core i7 laptop