Opening and destroying the same entry causes high memory usage
noreng opened this issue · 4 comments
I need to read the first row of a zipped csv file to create a table in a postgresql database, and then pg-copy the whole csv file to the table.
I open the same zip entry twice with zipFile.openReadStream
. After I read the first chunk, I destroy the stream (with readStream.destroy()
) and I open it again. Everything works as I expected, except the memoryUsage:
I have a ~1,6 GB csv file compressed to a 110 MB zip. When I open the entry for the first time, the max memory usage (RSS) is 120 MB. After re-opening the same entry, the RSS goes up high, from 120 MB to 1,7 GB. If I don't destroy the entry on the first run, and let it finish, the max RSS is 150 MB and it stays this low after I reopen the entry, so I don't think the problem is with re-opening the entry. There must be wrong how I destroy it.
My system: Windows 10, Node 10.
The excerpt of my code:
const yauzl = require('yauzl');
const fs = require('fs');
const promisify = require('es6-promisify');
(async function () {
const fileBuffer = await promisify(fs.readFile)('./large.zip');
console.log('started');
const yauzlFromBuffer = promisify(yauzl.fromBuffer);
const zipfile = await yauzlFromBuffer(fileBuffer, { lazyEntries: true });
const openReadStream = promisify(zipfile.openReadStream.bind(zipfile));
zipfile.readEntry();
zipfile.on('entry', async (entry) => {
console.log('1) read the entry');
const stream = await openReadStream(entry);
// read the first chunk only
const justFirstChunk = true;
await readEntry(stream, entry.uncompressedSize, justFirstChunk);
// This is where I handle the first row
console.log('2) read the same entry again');
const stream2 = await openReadStream(entry);
await readEntry(stream2, entry.uncompressedSize);
zipfile.readEntry();
});
await new Promise(resolve => {
zipfile.on('end', () => {
console.log('finished');
resolve();
});
});
}());
async function readEntry(readStream, entrySize, justFirstChunk) {
logMemoryUsage();
const onData = getProgressHandler(entrySize);
await new Promise((resolve, reject) => {
readStream
.on('error', reject)
.on('data', chunk => {
// destroy the stream after the first chunk
if (justFirstChunk) {
readStream.destroy();
console.log('readEntry destroyed');
logMemoryUsage();
resolve();
} else {
onData(chunk);
}
})
.on('end', () => {
console.log('readEntry end');
resolve();
});
});
}
function getProgressHandler(entrySize) {
let bytes = 0;
let step = 0.1;
let nextProgressStep = bytes + step;
return function (chunk) {
bytes += chunk.length;
const progress = Math.round(bytes / entrySize * 100) / 100;
if (progress >= nextProgressStep) {
console.log(progress);
logMemoryUsage();
nextProgressStep = ((nextProgressStep / step) + 1) * step;
}
};
}
function logMemoryUsage() {
const memoryMB = Math.round(process.memoryUsage().rss / (1024 * 1024));
console.log('memoryUsage:', memoryMB, 'MB');
}
Memory usage:
started
1) read the entry
memoryUsage: 121 MB
readEntry destroyed
memoryUsage: 121 MB
2) read the same entry again
memoryUsage: 121 MB
0.1
memoryUsage: 311 MB
0.2
memoryUsage: 479 MB
0.31
memoryUsage: 669 MB
0.4
memoryUsage: 818 MB
0.5
memoryUsage: 982 MB
0.61
memoryUsage: 1157 MB
0.71
memoryUsage: 1339 MB
0.81
memoryUsage: 1506 MB
0.91
memoryUsage: 1672 MB
readEntry end
finished
I've reproduced the issue. I'll look into this and get back to you.
i was able to resolve the issue by modifying the BufferSlicer.prototype.createReadStream
implementation in yauzl's fd-slicer
dependency. It appears that the problem comes from giving zlib.createInflateRaw()
stream a very large buffer in one chunk. If you cut up the buffer into 0x10000
byte chunks, there is no problem.
I'll make a PR against fd-slicer with this feature, and then update yauzl to depend on it.
published yauzl 2.9.2 that should fix this issue. give it a try and let me know what you observe.
I tested and it works fine, even with larger files (tested with a 300 MB zip). The memory usage is steady and low. Many thanks!