mrandrewmills/Fixed-Width-Text-File-Toolkit

de-duplication enhancement?

mrandrewmills opened this issue · 2 comments

Sometimes when you have a series of fixed width text files that you need to combine, you can wind up with duplicated rows of data. I'm considering adding a deduplication function. Thougths?

Revisited this idea tonight, and it seems much clearer than it did before.

Very similar to the approach I used for the filter, but using the end result buffer as the "check against" list instead-- that way you don't radically increasing the RAM usage. The question is how well does it scale in terms of performance times against large file samples?

Initially, my approach was to filter out duplicates during the import process. Then I found some code on StackOverflow that made filtering out the duplicates after the import had been done trivial, so I opted for that approach instead.