adaltas/node-csv

cvs-parse fails 'randomly' on multi-character column delimiter and row terminator

metarama opened this issue · 3 comments

Describe the bug

Thank you for the great work on csv-parse.

I am trying to import a csv files (over 850 rows with 51 columns) that is located in an AWS S3 folder. At the moment I am processing one row at a time.

I see the column count mismatch error happening on different rows. The erroring row keeps moving around. Sometimes the 11th row will fail. Sometime 439th row will fail. Whenever the row fails I can see the delimiter '|@|' has been taken in as part of the field. For example, a failing row comes in as '11|@|543E9108-177F-4402-B423-AF9A006A6698', '9', '00000000014621DD', ... You see the delimiter was not processed correctly and the first two fields got fused together, and as a result the column count does not match and thus the error. Also, the key observation is that the location of this error moves around.

Sometimes the delimiter all by itself becomes a field, like in this example of row data parsed, '2023-11-29 11:08:25.2150201 -05:00', '|@|', 'Boot', '', 'Testing, ...

I have not looked at the code but based on my observation I can guess that the delimiter (and terminator) detection misses the case when the delimiter (or terminator) character sequence spans across the current buffer boundary. The partial delimiter sequence at the end of the buffer or at the beginning of the buffer gets handled like a regular field characters and gets fused with a field.

Single character delimiters and terminators on the other hand are easier to process. Multi-character delimiter and row terminator will need extra care. I feel that care is missing.

To Reproduce

const parser = parse({ 
      delimiter: '|@|', 
      record_delimiter: '|$|', 
      columns: true, 
      from: 1, 
      relax_quotes: false,
      encoding: "utf8", 
      quote: '"',
      ignore_last_delimiters: true,
      trim: true, 
      bom: true, 
    } as Options);

const s3Stream = Readable.from(response.Body as AsyncIterable<Uint8Array>);
const csvStream = s3Stream.pipe(parser);

for await (const row of csvStream) {
  // process the row
}


Below is the __isDelimiter function defined at line 596 of node-csv/packages/csv-parse/lib/api/index.js.

I bolded the line 'if(del[j] !== buf[pos+j]) continue loop1;'. What if the pos+j exceeds the buffer length? Should this line be enhanced to handle spilling over the buffer length?

    __isDelimiter: function(buf, pos, chr){
      const {delimiter, ignore_last_delimiters} = this.options;
      if(ignore_last_delimiters === true && this.state.record.length === this.options.columns.length - 1){
        return 0;
      }else if(ignore_last_delimiters !== false && typeof ignore_last_delimiters === 'number' && this.state.record.length === ignore_last_delimiters - 1){
        return 0;
      }
      loop1: for(let i = 0; i < delimiter.length; i++){
        const del = delimiter[i];
        if(del[0] === chr){
          for(let j = 1; j < del.length; j++){
            **if(del[j] !== buf[pos+j]) continue loop1;**
          }
          return del.length;
        }
      }
      return 0;
    },

I see the identical problem in the __isRecordDelimiter function defined at line 614 of the same file mentioned above.

What if the pos+j exceeds the buffer length

There is a function check just for this. It is called __needMoreData. Look inside if you think that could be the issue.