Unique name for datasets
Closed this issue · 3 comments
Hi Ben et al.,
I like to have a unique, short name / identifier for each dataset, that I retrieve from the filenames.
COSORE filename look like e.g.,
data_d20190626_VARGAS.csv
in which I retrieve "VARGAS" as an ID.
However, some authors have multiple dataset, e.g.,
data_d20200305_VARGAS.csv
I can work around and write a script that gets "VARGAS_2" as ID for the second dataset, but maybe it could be done directly in COSORE.
for example, it is already done for e.g.,
data_d20200212_KAYE_LNE.csv
data_d20200212_KAYE_LNW.csv
or
data_d20190610_SIHI_H1.csv
data_d20190610_SIHI_H2.csv
This is not an essential change, but it could make things slightly easier for users.
Best,
Alexis
NOTE, here's what my script currently look like:
julia> # get the path of all COSORE input files
inputs = readdir(joinpath("Input", "COSORE", "datasets"), join = true);
julia> # example of a path name
inputs[1]
"Input/COSORE/datasets/data_d20190409_ANJILELI.csv"
julia> # Retrieve a short name for each dataset
Names = []
Any[]
julia> # in loop below,
# 38 is the number of character in e.g., "Input/COSORE/datasets/data_d20190409_"
# 4 is the number of character in ".csv"
[push!(Names, inputs[i][38:end-4]) for i = 1:length(inputs)];
julia> Names
82-element Vector{Any}:
"ANJILELI"
"ZOU"
"VARNER"
"ZHANG_maple"
"ZHANG_oak"
julia> # Create a Dictionary with name => dataframe
# e.g., Data["ZOU"] is ZOU site dataframe
Data = Dict(Names .=> [[] for i in 1:length(Names)]);
julia> [push!(Data[Names[i]], DataFrame(CSV.File(inputs[i]))) for i = 1:length(Names)];
julia> # Example
Data["ZOU"][1]
82314×10 DataFrame
Row │ CSR_PORT CSR_TIMESTAMP_BEGIN CSR_TIMESTAMP_END CSR_FLUX_CO2 CSR_FLUX_CH4 CSR_ ⋯
│ Int64 String String Float64 String Stri ⋯
───────┼───────────────────────────────────────────────────────────────────────────────────────
1 │ 1 2013-12-01 00:15:58 2013-12-01 00:17:58 1.52 NA Exp ⋯
2 │ 2 2013-12-01 00:19:42 2013-12-01 00:21:42 1.55 NA Exp
3 │ 3 2013-12-01 00:23:26 2013-12-01 00:25:26 0.99 NA Lin
Hi @AlexisRenchon -
I'm not sure I understand. The existing COSORE dataset name doesn't work for you because you might get two "VARGAS" names if you just strip out the initial date part? I.e., the names aren't guaranteed to be unique after stripping the date? Just want to make sure I understand the need/use case here. Thanks!
Hi @bpbond -
Yes, you understood correctly! I like to have unique names after stripping out the initial date part, just like you said.
It is not a big deal, but maybe it can be a small improvement of convenience for some users in future versions.
As I explained, I load COSORE database into a "matrix of matrix" called Data, and then I access each dataset by a short unique name, e.g., Data["Vargas"]. I could use the full name with date, but it would make it long to type, etc. Am I the only person doing this?
If this is just me, I could create my own Array with short names mapped to each dataset (instead of stripping out dates from filenames).
I am closing the issue, feel free to make change or not to filenames, this was just some thoughts =)
Hi @bpbond ,
I know I closed this issue, but I am coming back to it briefly.
I am doing some work with FLUXNET, and this reminded me of their standardized site ID: e.g., AU-Ade, US-Me1, US-Me2, ...
Which is pretty neat: two capital letter for the continent, a dash, 3 characters, 3 letters if unique site, 2 letter and 1 number if multiple sites.
COSORE could do something similar, as it is also a global database.
Even better, using the same convention as FLUXNET could help identify quickly sites that have both a flux tower and auto-resp, e.g., AU-Cum.