sbinet/npyio

writeData is very slow for array and slice

Closed this issue · 8 comments

case reflect.Array, reflect.Slice:
        n := rv.Len()
        for i := 0; i < n; i++ {
            elem := rv.Index(i)
            err := writeData(w, elem)
            if err != nil {
                return err
            }
        }
        return nil
    }

looping through array and slice calling writeData which uses reflection for each element is slow, consider use go code gen to generate write data for each data type.

would you have a small (or as small as possible) repro?
(phrased another way: what are the timings you experience for a []float64 slice of N elements?)

could you test 94eda44 ?
I get this kind of improvement:

benchmark                      old ns/op     new ns/op     delta
BenchmarkWriteFloatSlice-4     232706        35049         -84.94%

amazing, thanks!
I think you don't have to do

case []int:
        for _, vv := range v {
            err := binary.Write(w, ble, int64(vv))
            if err != nil {
                return err
            }
        }
        return nil

Probably can just do

case []int:
        err := binary.Write(w, ble, []int(v))

Could you add support for all slice type that binary.Write supports? By looking into binary source, here are they slices type that it supports.

        case []int8:
        case []uint8:
        case []int16:
        case []uint16:
        case []int32:
        case []uint32:
        case []int64:
        case []uint64:

My use case is I am serializing few GB of grayscale image (uint8) into npy file for training cnn.
Btw, really appreciate for the library! Able to do things in go as much as possible saved me a lot of time.

I tested type-switching on a few slice-types (starting with []float64) but it is useless.
at least when I call binary.Write(w, binary.LittleEndian, v) as binary.Write will type-switch/use-reflect under the hood.

so, unless I reach into the guts of binary.Write that's as far as we can get, ATM.

do you have any timings for before/after the (tentative) fix with these GB-size []uint8 slices?

(alternatively, if you have a code snippet + some input data, I am willing to do the benchmark myself)

With the old implementation, writing 1.3GB took me about 40min on macbook pro, so I killed the process (knowing that I have few more GB but I can't wait that long). Instead I generated 700mb of data with 20min. And right now I am training with that 700mb data. So here is my very rough numbers.

My code is very trivial, similar to

data := make([]uint8, 1024*1024*1024)//1GB
npyio.Write(f, data)

Instead I fill data with grayscale image, but without it should still produce similar speed numbers.

ok. I've added a fast-path for writing []uint8:

benchmark                      old ns/op     new ns/op     delta
BenchmarkWriteUint8Slice-4     2418          1977          -18.24%

benchmark                      old allocs    new allocs    delta
BenchmarkWriteUint8Slice-4     19            17            -10.53%

benchmark                      old bytes     new bytes     delta
BenchmarkWriteUint8Slice-4     1472          440           -70.11%

feel free to re-open if that's not sufficiently fast for your use-case.