zemirco/json2csv

Streaming API output is 3x times larger if streamed into express.response

zlibmaster opened this issue · 5 comments

Hi!
I would appreciate help with optimization of generating and downloading a csv.
Inside an express route I'm converting data to csv with steps as follow:

Option A:

  1. pipe objects from a mongo cursor stream into json2csv
  2. pipe json2csv stream into express response.
    The output file downloaded is ~36MB.

If I add following steps between 1) and 2) the file will only be ~22MB:

Option B:

  1. Same as above
  2. pipe json2csv into fs.createWriteStream
  3. read created file with fs.createReadStream
  4. pipe readStream into express response

Bonus Option C:

Generating an xlsx file with the same data using ExcelJS comes to only 12MB. How come? I thought CSV should be light-weight

Also, the file sizes in Option B vary in size (20-25MB), while Options A+C are always exactly the same size.
Why is that? Is it something with json2csv, my code or some encoding optimizations in the fs module?
(obviously there is an overhead with option B using fs of about 30% more computing time)

lib version: 5.0.6
node version: 14.14.0
code:

Option A (36MB):

  const input = new Readable({ objectMode: true });
  input._read = () => {};

  const opts = { fields, withBOM: true };
  const transformOpts = {objectMode: true};
  const json2csv = new Json2Csv(opts, transformOpts);

  input.pipe(json2csv).pipe(res);  // express response
  while (await cursor.hasNext()) {
    const dataRow = await cursor.next();
    input.push(dataRow));
  }
  input.push(null);

Option B (20-25MB):

  const input = new Readable({ objectMode: true });
  input._read = () => {};
  const opts = { fields, withBOM: true };
  const transformOpts = {objectMode: true};
  const json2csv = new Json2Csv(opts, transformOpts);
  const write = fs.createWriteStream('./output.csv');
  input.pipe(json2csv).pipe(write);
 
  while (await cursor.hasNext()) {
    const dataRow = await cursor.next();
    input.push(dataRow));
  }
  input.push(null);

  const read = fs.createReadStream('./output.csv');
  read.pipe(res); // express response

Express server configuration:

const server = express();
server.use(bodyParser.urlencoded({extended: false}));
server.use(bodyParser.json());
server.use(helmet());
server.use(compression());

Don't think it matters but the request to the route is done from browser with a GET request using fetch, converting the response to blob, and using window.URL.createObjectURL(file);

Thanks!

From your example it doesn't seem like you're using the streaming API mentioned here https://github.com/zemirco/json2csv/tree/v5#json2csv-async-parser-streaming-api

Thank you for your comment!

You are correct. I'm using the transform stream from here:
https://github.com/zemirco/json2csv/tree/v5#json2csv-transform-streaming-api

I now tested your reference as well in the following code, and it gives the same result as "Option A":

  const input = new Readable({ objectMode: true });
  input._read = () => {};
  const opts = { fields, withBOM: true };
  const transformOpts = {objectMode: true};
  const asyncParser = new AsyncParser(opts, transformOpts);
  const parsingProcessor = asyncParser.fromInput(input).toOutput(res);

  while (await cursor.hasNext()) {
    const dataRow = await cursor.next();
    input.push(dataRow));
  }
  input.push(null);

If the file is bigger, what's different about the content? Can you do a diff of the two files?

Not sure what you intended, but I tried windows' FC and received:

Comparing files results111.csv and RESULTS222.CSV
Resync Failed.  Files are too different.

With poorly encoded gibbrish which looks like my data but I can't udnerstand what are the differences.

Also few notes I added in my original post:

  1. I edited the sizes. the difference is actualy ~1.5x not 3x.
  2. The smaller files outputted with Option B are varying in size from another by 5-10%, while the larger files frm Option A are always the same size exactly.
  3. Same data written to file using ExcelJS comes to 50% size of Option B.

If the file is bigger, what's different about the content? Can you do a diff of the two files?

After checking again, it appears content in Option A and Option B are not similiar, and some of the data is not saved to file in Option B.
Not sure why, but since they are the same size if they would have the same data this issue is irrelevant.
Anyway I'm opening a new issue, related, but on another optimization.

@knownasilya thanks!