Differentiate Between Duplicate Headers

Question

Differentiate Between Duplicate Headers

sshaw opened this issue 4 years ago · 7 comments

Duplicate headers are a legitimate use case for endusers. The header X can be "unique" within the context of its neighboring columns and/or rows. Having to come up with an alternate, artificial name can be confusing for endusers and maybe even developers.

I see in v2 one can disable the duplicate headers option, but it appears there's no way access to the underlying "duplicates". Ruby's CSV library allows one to specify an offset into a Row.

Is it possible to support reading rows with duplicates in SmaterCSV 2?

Answer 1 · 2021-02-13T17:45:04.000Z

I also see #111 but it's not clear what the solution is or if it's applicable to this.

Answer 2 · 2022-04-25T03:22:10.000Z

an offset into a row seems a bit obscure.

@sshaw how about adding a suffix to the name of the duplicate header.. would that work?

e.g.

name,name,name
tim,joe,tom

=> { name: 'tim', name_2: 'joe', name_3: 'tom'}

?

Answer 3 · 2022-04-26T14:23:24.000Z

@sshaw how about adding a suffix to the name of the duplicate header.. would that work?
...
=> { name: 'tim', name_2: 'joe', name_3: 'tom'}

Hi!

Seems odd but not necessarily opposed to it. What is downside to this versus using the same header name and making the value an Array? E.g., { name: %w[tim joe tom] }? This feels more "natural" and one can determine if there are multiple values easier:

data[:name].is_a?(Array)
data[:name].each { ... }

vs

data.find { |key| key =~ /name(?:_\d+)/ }
data.keys.select { |key| key =~ /name(?:_\d+)/ }.each { |key| data[key] }

Thoughts?

Answer 4 · 2022-04-26T18:59:40.000Z

@sshaw in my understanding, typical CSV files are "flat" in the sense that all keys are on the same level.

The premise of smarter_csv is to generate a "flat" hash of the line in the CSV file, that can be used right away for processing or insertion into a data store.

Using { name: %[tim joe tom]} seems odd in that respect.

I've never encountered CSV files with duplicate headers.
I'd probably go the route to also use key_mappings for a file with duplicate headers, and do something like this:

   data = SmarterCSV.process(filename, {duplicate_header_suffix: '', key_mapping: {name2: :best_friend, name3: :nemesis}}
   
   => { name: 'tim', best_friend: 'joe', nemesis: 'tom' }

What do you think?
Maybe it would help me to have a sample to better understand your use case.

Answer 5 · 2022-04-26T23:08:34.000Z

I guess duplicate headers in CSV files aren't illegal, although they seem irregular. it definitely breaks the hash functionality, but I wouldn't over-optimize around this use case. SmarterCSV could return an object with both hash access (ie. current functionality) and array style index offset?

Answer 6 · 2022-04-27T01:42:30.000Z

duplicate headers in CSV files are rare.

Default behavvior: no change in behavior: duplicate headers will raise DuplicateHeaders, unless duplicate_header_suffix is defined.

If duplicate_header_suffix is defined, smarter_csv will append numbers 2..n to duplicate headers.

Why not array notation for duplicate headers:

at the end of the day, the user will need to handle the ambiguity .. either by adding code to handle arrays, or by re-mapping the keys to a disambiguated name. The key re-mapping seems to be much easier, because it can be done without extra code. That is why I'm going this route, and not with the array notation.

Answer 7 · 2022-04-27T01:47:52.000Z

fixed in 1.5.1