np.fromfile() is very slow

Question

np.fromfile() is very slow

OleksiiMatiash opened this issue 2 years ago · 9 comments

I'm porting app from python to c# and now I'm trying to choose .net numpy equivalent. Options are NumpyDotNet and NumSharp.
NumpyDotNet is obvious winner because NumSharp has "not implemented" here and there, and absent documentation and samples. But my app needs to read and write lots of data as fast as it is possible, and here is the problem - NumpyDotNet's np.fromfile() is very slow compared to NumSharp's. Here is a benchmark, MB\s:

NumpyDotNet:

np.fromfile(fullFilePath, np.UInt8); ~150

np.fromfile(fullFilePath, np.UInt16); ~140

np.fromfile(fullFilePath, np.UInt32); ~260

byte[] bytes = File.ReadAllBytes(fullFilePath);
return np.frombuffer(bytes, np.UInt8, metadata.dataSize, metadata.dataOffset); ~580

byte[] bytes = File.ReadAllBytes(fullFilePath);
return np.frombuffer(bytes, np.UInt16, metadata.dataSize, metadata.dataOffset); ~690

byte[] bytes = File.ReadAllBytes(fullFilePath);
return np.frombuffer(bytes, np.UInt32, metadata.dataSize, metadata.dataOffset); ~690

(don't mind offset and size, almost whole array is read)

NumSharp:

np.fromfile(fullFilePath, NPTypeCode.Int32); ~2700

reading as NPTypeCode.Int16 is not implemented in NumSharp, so I'm unable to measure.

python numpy:

np.fromfile(file, np.uint16) ~2700

My app mostly works with UInt16 with some Float32 in the middle of calculation chain, so I need effective reading\writing of UInt16.

Answer 1 · 2023-10-22T22:16:29.000Z

I am on vacation so don't have a lot of time to work on this.

I recommend that you use the ToSerializable/FromSerializable methods to save/restore an ndarray. Then you can use .NET standard XML/JSON serialization operations to save/restore

If you really need to use fromfile for some reason and the performance does not meet your needs, I suggest trying to write your own code to open a file and parse/save it.

Answer 2 · 2023-10-22T22:17:48.000Z

use either ndarray.ToSerializable() or np.ToSerializable(ndarray a).

Answer 3 · 2023-10-22T22:25:27.000Z

please look at issue 48 in this repository for example code

Answer 4 · 2023-10-23T02:56:59.000Z

I'm sorry for not mentioning that I need to read\write binary files, i.e. not XML/JSON. To be precise - I need to read file, then create ndarray from this file using some small offset from the start till the end of the file. Do calculations, and write new data to the same file with the same offset. Typical file size is 100 MB, offset - 100 KB.
In python app I'm doing this:
def readImageData(fileName: str, offset: int, length: int) -> ndarray:
return np.fromfile(fileName, dtype = np.uint16, count = length, offset = offset)

So it seems to me that To\FromSerializable is not the right choice.

Answer 5 · 2023-10-23T10:49:08.000Z

below is basically what I am doing internally. I don't have time to write the tofile completely today.
If you can make this go faster, then you have a solution.

       [TestMethod]
        public void test_OleksiiMatiash_1()
        {
            string fileName = "xyz.bin";

            ndarray x = np.arange(0, 25, dtype:np.Int16);


            tofile(x, fileName);

            int length = 100;
            int offset = 10;

            fromfile(fileName, length, offset);
        }

        private void tofile(ndarray x, string fileName)
        {
            System.IO.FileInfo fp = new System.IO.FileInfo(fileName);


            //using (var fs = fp.Create())
            //{
            //    //return NpyArray_ToBinaryStream(self, fs);

            //    //using (var binaryWriter = new System.IO.BinaryWriter(fs))
            //}.


        }

        private ndarray fromfile(string fileName, int length, int offset)
        {
            System.IO.FileInfo fp = new System.IO.FileInfo(fileName);

            Int16[] data = new Int16[length - offset];

            using (var fs = fp.OpenRead())
            {
                fs.Seek(offset * sizeof(Int16), System.IO.SeekOrigin.Begin);

                using (System.IO.BinaryReader sr = new System.IO.BinaryReader(fs))
                {
                    for (int i = 0; i < data.Length; i++)
                    {
                        data[i] = sr.ReadInt16();
                    }
                }
       
            }

            return np.array(data);

        }

Answer 6 · 2023-10-23T12:59:32.000Z

One big difference is that python/C code can very quickly cast an array of Int16 values to a byte array and do a very fast write of the data. .NET does not like it if you try to cast Int16 to byte so you have to write each value in a loop. That will be slower.

Answer 7 · 2023-10-24T06:11:23.000Z

One big difference is that python/C code can very quickly cast an array of Int16 values to a byte array and do a very fast write of the data. .NET does not like it if you try to cast Int16 to byte so you have to write each value in a loop. That will be slower.

Got it, thank you. Thinking now if I can get enough speed with .net at all :(

Answer 8 · 2023-10-24T09:31:08.000Z

Here is another idea. Convert the array to bytes first and then write it to disk. See the example below.
I will leave it to you to measure the performance.

     [TestMethod]
        public void test_OleksiiMatiash_1()
        {
            string fileName = "xyz.bin";

            ndarray x = np.arange(0, 25, dtype:np.Int16);

                   
            tofile(x, fileName);

            int length = 100;
            int offset = 10;

            ndarray y = fromfile(fileName, length, offset);

            return;
        }

        private void tofile(ndarray x, string fileName)
        {
            System.IO.FileInfo fp = new System.IO.FileInfo(fileName);

            byte[] b = x.tobytes();


            using (var fs = fp.Create())
            {
        
                using (var binaryWriter = new System.IO.BinaryWriter(fs))
                {
                    binaryWriter.Write(b);
                }
            }


        }

        private ndarray fromfile(string fileName, int length, int offset)
        {
            System.IO.FileInfo fp = new System.IO.FileInfo(fileName);

            byte[] data = null;

            using (var fs = fp.OpenRead())
            {
                fs.Seek(offset * sizeof(Int16), System.IO.SeekOrigin.Begin);

                using (System.IO.BinaryReader sr = new System.IO.BinaryReader(fs))
                {
                    data = sr.ReadBytes((length - offset) * sizeof(Int16));
                }
       
            }

            return np.frombuffer(data, dtype: np.Int16);

          

        }

Answer 9 · 2023-10-24T11:22:43.000Z

Here is another idea. Convert the array to bytes first and then write it to disk. See the example below. I will leave it to you to measure the performance.

        private ndarray fromfile(string fileName, int length, int offset)
        {
            System.IO.FileInfo fp = new System.IO.FileInfo(fileName);

            byte[] data = null;

            using (var fs = fp.OpenRead())
            {
                fs.Seek(offset * sizeof(Int16), System.IO.SeekOrigin.Begin);

                using (System.IO.BinaryReader sr = new System.IO.BinaryReader(fs))
                {
                    data = sr.ReadBytes((length - offset) * sizeof(Int16));
                }
       
            }

            return np.frombuffer(data, dtype: np.Int16);
        }

This is the fastest method of all in scope of NumpyDotNet, achieved 810 MB\s.
So reading is fast enough. I'm unable to test writing speed right now, but I hope it will be on pair with reading.

Thank you!