Memory Exception When Merging Large Volumes of Waveform Data Files Using wrdb.wrsamp()
DishanH opened this issue · 2 comments
I'm trying to merge multiple waveform data (.dat) files into a single file. I'm using the wrdb.wrsamp()
function for this task. The total number of files is approximately 10,000 and each one has 3 channels. I've tried several times, but every attempt results in a memory exception, requiring more than 40GB of memory. I'm unsure if I am doing something incorrect.
I've been unable to find a method to write the files incrementally. My current approach is to read each sample, combine all signals into an array, and write them. While this works fine with a small number of files, I'm having difficulties when it comes to larger datasets. Each file contains over 6 minutes of data.
Any assistance insights or suggestions on this matter would be highly appreciated.
I have modified the library to use chunking instead of concatenating everything at once, resulting in a 300% reduction in memory usage compared to the original.
chunk_size = 1000000
b_write = np.zeros((0,), dtype=np.uint8)
p = 0
for i in range(0, len(d_signal), chunk_size):
print(p.__str__() + " of " + (int(len(d_signal) / chunk_size)).__str__())
p += 1
chunk = d_signal[i:i+chunk_size]
b1 = chunk & [255] * tsamps_per_frame
b2 = (chunk & [65280] * tsamps_per_frame) >> 8
# Interweave the bytes so that the same samples' bytes are consecutive
b1 = b1.reshape((-1, 1))
b2 = b2.reshape((-1, 1))
chunk_bytes = np.concatenate((b1, b2), axis=1)
chunk_bytes = chunk_bytes.reshape((1, -1))[0]
# Convert to un_signed 8 bit dtype to write
chunk_bytes = chunk_bytes.astype("uint8")
b_write = np.concatenate((b_write, chunk_bytes))`
Thanks! Just to be clear, I assume you're talking about the function wr_dat_file
, and your code would be to replace the code at lines 2381 to 2392 (following elif fmt == "16"
).
The existing code looks to me like it's a lot more complicated than it needs to be. I'm sure that your replacement code is more efficient, but I also suspect that the entire thing could be replaced with just one or two numpy function calls - there's no need to make so many copies of the data.
Compare this with how format 80 is handled (see the code under if fmt == "80"
). Format 16 could probably be handled in a very similar way - we don't need to add an offset in that case, but we do need to convert to little-endian 16-bit integers and then reinterpret as an array of bytes.
Please consider opening a pull request with your changes.