np.fromfile() is very slow
OleksiiMatiash opened this issue · 9 comments
I'm porting app from python to c# and now I'm trying to choose .net numpy equivalent. Options are NumpyDotNet and NumSharp.
NumpyDotNet is obvious winner because NumSharp has "not implemented" here and there, and absent documentation and samples. But my app needs to read and write lots of data as fast as it is possible, and here is the problem - NumpyDotNet's np.fromfile() is very slow compared to NumSharp's. Here is a benchmark, MB\s:
NumpyDotNet:
np.fromfile(fullFilePath, np.UInt8); ~150
np.fromfile(fullFilePath, np.UInt16); ~140
np.fromfile(fullFilePath, np.UInt32); ~260
byte[] bytes = File.ReadAllBytes(fullFilePath);
return np.frombuffer(bytes, np.UInt8, metadata.dataSize, metadata.dataOffset); ~580
byte[] bytes = File.ReadAllBytes(fullFilePath);
return np.frombuffer(bytes, np.UInt16, metadata.dataSize, metadata.dataOffset); ~690
byte[] bytes = File.ReadAllBytes(fullFilePath);
return np.frombuffer(bytes, np.UInt32, metadata.dataSize, metadata.dataOffset); ~690
(don't mind offset and size, almost whole array is read)
NumSharp:
np.fromfile(fullFilePath, NPTypeCode.Int32); ~2700
reading as NPTypeCode.Int16 is not implemented in NumSharp, so I'm unable to measure.
python numpy:
np.fromfile(file, np.uint16) ~2700
My app mostly works with UInt16 with some Float32 in the middle of calculation chain, so I need effective reading\writing of UInt16.
I am on vacation so don't have a lot of time to work on this.
I recommend that you use the ToSerializable/FromSerializable methods to save/restore an ndarray. Then you can use .NET standard XML/JSON serialization operations to save/restore
If you really need to use fromfile for some reason and the performance does not meet your needs, I suggest trying to write your own code to open a file and parse/save it.
use either ndarray.ToSerializable() or np.ToSerializable(ndarray a).
please look at issue 48 in this repository for example code
I'm sorry for not mentioning that I need to read\write binary files, i.e. not XML/JSON. To be precise - I need to read file, then create ndarray from this file using some small offset from the start till the end of the file. Do calculations, and write new data to the same file with the same offset. Typical file size is 100 MB, offset - 100 KB.
In python app I'm doing this:
def readImageData(fileName: str, offset: int, length: int) -> ndarray:
return np.fromfile(fileName, dtype = np.uint16, count = length, offset = offset)
So it seems to me that To\FromSerializable is not the right choice.
below is basically what I am doing internally. I don't have time to write the tofile completely today.
If you can make this go faster, then you have a solution.
[TestMethod]
public void test_OleksiiMatiash_1()
{
string fileName = "xyz.bin";
ndarray x = np.arange(0, 25, dtype:np.Int16);
tofile(x, fileName);
int length = 100;
int offset = 10;
fromfile(fileName, length, offset);
}
private void tofile(ndarray x, string fileName)
{
System.IO.FileInfo fp = new System.IO.FileInfo(fileName);
//using (var fs = fp.Create())
//{
// //return NpyArray_ToBinaryStream(self, fs);
// //using (var binaryWriter = new System.IO.BinaryWriter(fs))
//}.
}
private ndarray fromfile(string fileName, int length, int offset)
{
System.IO.FileInfo fp = new System.IO.FileInfo(fileName);
Int16[] data = new Int16[length - offset];
using (var fs = fp.OpenRead())
{
fs.Seek(offset * sizeof(Int16), System.IO.SeekOrigin.Begin);
using (System.IO.BinaryReader sr = new System.IO.BinaryReader(fs))
{
for (int i = 0; i < data.Length; i++)
{
data[i] = sr.ReadInt16();
}
}
}
return np.array(data);
}
One big difference is that python/C code can very quickly cast an array of Int16 values to a byte array and do a very fast write of the data. .NET does not like it if you try to cast Int16 to byte so you have to write each value in a loop. That will be slower.
One big difference is that python/C code can very quickly cast an array of Int16 values to a byte array and do a very fast write of the data. .NET does not like it if you try to cast Int16 to byte so you have to write each value in a loop. That will be slower.
Got it, thank you. Thinking now if I can get enough speed with .net at all :(
Here is another idea. Convert the array to bytes first and then write it to disk. See the example below.
I will leave it to you to measure the performance.
[TestMethod]
public void test_OleksiiMatiash_1()
{
string fileName = "xyz.bin";
ndarray x = np.arange(0, 25, dtype:np.Int16);
tofile(x, fileName);
int length = 100;
int offset = 10;
ndarray y = fromfile(fileName, length, offset);
return;
}
private void tofile(ndarray x, string fileName)
{
System.IO.FileInfo fp = new System.IO.FileInfo(fileName);
byte[] b = x.tobytes();
using (var fs = fp.Create())
{
using (var binaryWriter = new System.IO.BinaryWriter(fs))
{
binaryWriter.Write(b);
}
}
}
private ndarray fromfile(string fileName, int length, int offset)
{
System.IO.FileInfo fp = new System.IO.FileInfo(fileName);
byte[] data = null;
using (var fs = fp.OpenRead())
{
fs.Seek(offset * sizeof(Int16), System.IO.SeekOrigin.Begin);
using (System.IO.BinaryReader sr = new System.IO.BinaryReader(fs))
{
data = sr.ReadBytes((length - offset) * sizeof(Int16));
}
}
return np.frombuffer(data, dtype: np.Int16);
}
Here is another idea. Convert the array to bytes first and then write it to disk. See the example below. I will leave it to you to measure the performance.
private ndarray fromfile(string fileName, int length, int offset) { System.IO.FileInfo fp = new System.IO.FileInfo(fileName); byte[] data = null; using (var fs = fp.OpenRead()) { fs.Seek(offset * sizeof(Int16), System.IO.SeekOrigin.Begin); using (System.IO.BinaryReader sr = new System.IO.BinaryReader(fs)) { data = sr.ReadBytes((length - offset) * sizeof(Int16)); } } return np.frombuffer(data, dtype: np.Int16); }
This is the fastest method of all in scope of NumpyDotNet, achieved 810 MB\s.
So reading is fast enough. I'm unable to test writing speed right now, but I hope it will be on pair with reading.
Thank you!