fatkodima/sidekiq-iteration

CsvEnumerator `.size` can be wrong with quoted embedded newlines in a cell

Opened this issue · 2 comments

There's some great stuff here! I noticed one thing that might be a problem somewhere, but probably isn't in practice:

If a (quoted) CSV cell contains embedded newlines—which unfortunately can happen, especially with web form input or something like option-return in Google Sheets—then using wc naively as in this line will return a count larger than the number of rows:

       count = `wc -l < #{filepath}`.strip.to_i

It may not matter much, but it does mean that the enumerator's .size will be wrong.

Unfortunately, I don't think in general there's a good way to deal with this other than doing a first pass with the same CSV library (or equivalent parsing logic) to get the row count. But as long as size is at least as large as the number of rows, it could just be reasonable to state as a known limitation.

I noticed the array_enumerator method has the same issue, the size will be wrong if any elements are dropped by the cursor because it references the original array and not the pruned one.

- array.each_with_index.drop(cursor || 0).to_enum { array.size }
+ x = array.each_with_index.drop(cursor || 0)
+ x.to_enum { x.size }

I noticed the array_enumerator method has the same issue, the size will be wrong if any elements are dropped by the cursor because it references the original array and not the pruned one.

I basically copied the implementation from the job-iteration. It is done this way for all the enumerators. I think this was done on purpose to get the size of the whole iterable collection. Since we kinda implement and return our own enumerators, we should return the size of the enumerated collection, like ruby built it enumerators do, not depending on which position in it we are now) and it also allows to track progress by comparing current cursor to enum.size). https://github.com/Shopify/job-iteration/blob/10787b962a33ea26c2b75f6eaffd6f773eb06937/test/unit/csv_enumerator_test.rb#L47-L50