JuliaData/CSV.jl

Error when passing as `source` a vector with fewer unique elements than files.

snovum opened this issue · 0 comments

In cases where data from n sources is collected periodically in k separate files such that k > n (e.g. patient, financial data etc.) using the source kwarg to create a source column with non unique identifier returns a BoundsError.

Suppose I have the following CSV files.

Alice1 = DataFrame(A = 1:5, B = ["M", "F", "F", "M", "F"])

Alice2 = DataFrame(A = 11:15, B = ["A", "B", "C", "D", "E"])

Bob1 = DataFrame(A = 16:21, B = ["A", "B", "C", "D", "E","T"])

Bob2 = DataFrame(A = 16:21, B = ["A", "B", "C", "D", "E","T"])

CSV.write("folder/Alice-2024-04-15.csv",Alice1)
CSV.write("folder/Alice-2024-04-16.csv", Alice2)
CSV.write("folder/Bob-2024-04-15.csv", Bob1)
CSV.write("folder/Bob-2024-04-16.csv", Bob2) 

fls = glob("*.csv", "folder")

I want to combine them into a new DataFrame while creating a new column that contains just the first part of the filename i.e. the patient names. I can create a new column with the filenames using the source kwarg

julia> DF = CSV.read(fls, DataFrame; source = "patients" => split.(basename.(fls), '-'), pool = false)
22×3 DataFrame
 Row │ A      B        patients                          
     │ Int64  String1  Array…                            
─────┼───────────────────────────────────────────────────
   1 │     1  M        SubString{String}["Alice", "2024…
   2 │     2  F        SubString{String}["Alice", "2024…
   3 │     3  F        SubString{String}["Alice", "2024…
   4 │     4  M        SubString{String}["Alice", "2024…
   5 │     5  F        SubString{String}["Alice", "2024…
   6 │    11  A        SubString{String}["Alice", "2024…
   7 │    12  B        SubString{String}["Alice", "2024…
   8 │    13  C        SubString{String}["Alice", "2024…
   9 │    14  D        SubString{String}["Alice", "2024…
  10 │    15  E        SubString{String}["Alice", "2024…
  11 │    16  A        SubString{String}["Bob", "2024",…
  12 │    17  B        SubString{String}["Bob", "2024",…
  13 │    18  C        SubString{String}["Bob", "2024",…
  14 │    19  D        SubString{String}["Bob", "2024",…
  15 │    20  E        SubString{String}["Bob", "2024",…
  16 │    21  T        SubString{String}["Bob", "2024",…
  17 │    16  A        SubString{String}["Bob", "2024",…
  18 │    17  B        SubString{String}["Bob", "2024",…
  19 │    18  C        SubString{String}["Bob", "2024",…
  20 │    19  D        SubString{String}["Bob", "2024",…
  21 │    20  E        SubString{String}["Bob", "2024",…
  22 │    21  T        SubString{String}["Bob", "2024",…

However if I try to create the column by taking the first element of each substring via first.(split.(basename.(fls),'-')) . The following error is returned.

CSV.read(fls,DataFrame; source = "patients" => first.(split.(basename.(fls), '-')), pool = false)
ERROR: BoundsError: attempt to access 2-element Vector{SubString{String}} at index [4]
Stacktrace:
 [1] setindex!
   @ ./array.jl:1021 [inlined]
 [2] setindex!
   @ ./multidimensional.jl:698 [inlined]
 [3] _invert(d::Dict{SubString{String}, UInt32})
   @ PooledArrays ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:26
 [4] PooledArray
   @ ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:87 [inlined]
 [5] CSV.File(sources::Vector{String}; source::Pair{String, Vector{SubString{String}}}, kw::@Kwargs{pool::Bool})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:941
 [6] File
   @ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
 [7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{source::Pair{String, Vector{…}}, pool::Bool})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
 [8] top-level scope
   @ REPL[367]:1
Some type information was truncated. Use `show(err)` to see complete types.

as already pointed out here by @nilshg, this is due to the fact that there are fewer unique elements than files, which can be helpful in identifying unique sources.

In this case the desired result is the creation of a source column of non - unique identifiers.

DF.patients = first.(DF.patients); 
DF
22×3 DataFrame
 Row │ A      B        patients  
     │ Int64  String1  SubStrin… 
─────┼───────────────────────────
   1 │     1  M        Alice
   2 │     2  F        Alice
   3 │     3  F        Alice
   4 │     4  M        Alice
   5 │     5  F        Alice
   6 │    11  A        Alice
   7 │    12  B        Alice
   8 │    13  C        Alice
   9 │    14  D        Alice
  10 │    15  E        Alice
  11 │    16  A        Bob
  12 │    17  B        Bob
  13 │    18  C        Bob
  14 │    19  D        Bob
  15 │    20  E        Bob
  16 │    21  T        Bob
  17 │    16  A        Bob
  18 │    17  B        Bob
  19 │    18  C        Bob
  20 │    19  D        Bob
  21 │    20  E        Bob
  22 │    21  T        Bob

It might be nice to incorporate the possibility of non-unique sources with the source kwarg because it seems rather intuitive especially in cases where data may have accumulated over separate files meaningfully over time.

Alice-2024-04-15.csv
Alice-2024-04-16.csv
Bob-2024-04-15.csv
Bob-2024-04-16.csv