de-duplication enhancement?

Question

de-duplication enhancement?

mrandrewmills opened this issue 10 years ago · 2 comments

Sometimes when you have a series of fixed width text files that you need to combine, you can wind up with duplicated rows of data. I'm considering adding a deduplication function. Thougths?

Answer 1 · 2015-07-24T03:13:32.000Z

Revisited this idea tonight, and it seems much clearer than it did before.

Very similar to the approach I used for the filter, but using the end result buffer as the "check against" list instead-- that way you don't radically increasing the RAM usage. The question is how well does it scale in terms of performance times against large file samples?

Answer 2 · 2016-02-10T03:44:52.000Z

Initially, my approach was to filter out duplicates during the import process. Then I found some code on StackOverflow that made filtering out the duplicates after the import had been done trivial, so I opted for that approach instead.