Error when passing as `source` a vector with fewer unique elements than files.
snovum opened this issue · 0 comments
In cases where data from n sources is collected periodically in k separate files such that k > n (e.g. patient, financial data etc.) using the source
kwarg
to create a source column with non unique identifier returns a BoundsError.
Suppose I have the following CSV files.
Alice1 = DataFrame(A = 1:5, B = ["M", "F", "F", "M", "F"])
Alice2 = DataFrame(A = 11:15, B = ["A", "B", "C", "D", "E"])
Bob1 = DataFrame(A = 16:21, B = ["A", "B", "C", "D", "E","T"])
Bob2 = DataFrame(A = 16:21, B = ["A", "B", "C", "D", "E","T"])
CSV.write("folder/Alice-2024-04-15.csv",Alice1)
CSV.write("folder/Alice-2024-04-16.csv", Alice2)
CSV.write("folder/Bob-2024-04-15.csv", Bob1)
CSV.write("folder/Bob-2024-04-16.csv", Bob2)
fls = glob("*.csv", "folder")
I want to combine them into a new DataFrame while creating a new column that contains just the first part of the filename i.e. the patient names. I can create a new column with the filenames using the source kwarg
julia> DF = CSV.read(fls, DataFrame; source = "patients" => split.(basename.(fls), '-'), pool = false)
22×3 DataFrame
Row │ A B patients
│ Int64 String1 Array…
─────┼───────────────────────────────────────────────────
1 │ 1 M SubString{String}["Alice", "2024…
2 │ 2 F SubString{String}["Alice", "2024…
3 │ 3 F SubString{String}["Alice", "2024…
4 │ 4 M SubString{String}["Alice", "2024…
5 │ 5 F SubString{String}["Alice", "2024…
6 │ 11 A SubString{String}["Alice", "2024…
7 │ 12 B SubString{String}["Alice", "2024…
8 │ 13 C SubString{String}["Alice", "2024…
9 │ 14 D SubString{String}["Alice", "2024…
10 │ 15 E SubString{String}["Alice", "2024…
11 │ 16 A SubString{String}["Bob", "2024",…
12 │ 17 B SubString{String}["Bob", "2024",…
13 │ 18 C SubString{String}["Bob", "2024",…
14 │ 19 D SubString{String}["Bob", "2024",…
15 │ 20 E SubString{String}["Bob", "2024",…
16 │ 21 T SubString{String}["Bob", "2024",…
17 │ 16 A SubString{String}["Bob", "2024",…
18 │ 17 B SubString{String}["Bob", "2024",…
19 │ 18 C SubString{String}["Bob", "2024",…
20 │ 19 D SubString{String}["Bob", "2024",…
21 │ 20 E SubString{String}["Bob", "2024",…
22 │ 21 T SubString{String}["Bob", "2024",…
However if I try to create the column by taking the first element of each substring via first.(split.(basename.(fls),'-')) . The following error is returned.
CSV.read(fls,DataFrame; source = "patients" => first.(split.(basename.(fls), '-')), pool = false)
ERROR: BoundsError: attempt to access 2-element Vector{SubString{String}} at index [4]
Stacktrace:
[1] setindex!
@ ./array.jl:1021 [inlined]
[2] setindex!
@ ./multidimensional.jl:698 [inlined]
[3] _invert(d::Dict{SubString{String}, UInt32})
@ PooledArrays ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:26
[4] PooledArray
@ ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:87 [inlined]
[5] CSV.File(sources::Vector{String}; source::Pair{String, Vector{SubString{String}}}, kw::@Kwargs{pool::Bool})
@ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:941
[6] File
@ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
[7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{source::Pair{String, Vector{…}}, pool::Bool})
@ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
[8] top-level scope
@ REPL[367]:1
Some type information was truncated. Use `show(err)` to see complete types.
as already pointed out here by @nilshg, this is due to the fact that there are fewer unique elements than files, which can be helpful in identifying unique sources.
In this case the desired result is the creation of a source column of non - unique identifiers.
DF.patients = first.(DF.patients);
DF
22×3 DataFrame
Row │ A B patients
│ Int64 String1 SubStrin…
─────┼───────────────────────────
1 │ 1 M Alice
2 │ 2 F Alice
3 │ 3 F Alice
4 │ 4 M Alice
5 │ 5 F Alice
6 │ 11 A Alice
7 │ 12 B Alice
8 │ 13 C Alice
9 │ 14 D Alice
10 │ 15 E Alice
11 │ 16 A Bob
12 │ 17 B Bob
13 │ 18 C Bob
14 │ 19 D Bob
15 │ 20 E Bob
16 │ 21 T Bob
17 │ 16 A Bob
18 │ 17 B Bob
19 │ 18 C Bob
20 │ 19 D Bob
21 │ 20 E Bob
22 │ 21 T Bob
It might be nice to incorporate the possibility of non-unique sources with the source
kwarg because it seems rather intuitive especially in cases where data may have accumulated over separate files meaningfully over time.
Alice-2024-04-15.csv
Alice-2024-04-16.csv
Bob-2024-04-15.csv
Bob-2024-04-16.csv