spectral-cockpit/opusreader2

Introduce top-level `read_opus()`

Closed this issue · 25 comments

I think this is a great idea.

Ultimately, this is a "driver" package, which means, in my view, that most users would really only interact with one function, read_opus.

If I think of a prototype for the function, this is what it would look like:

res <- read_opus(
  # Pass a vector of DSN, can be files or raw sources, can be more than one
  dsn = my_dsn, 
  # Option to "simplify" the output when more than 1 DSN are passed to `dsn`. 
  # If used an "analysis-ready" matrix is returned. Similar to the `simplify` option in {opusreader}
  matrix = FALSE, 
  # An option to limit the data returned to a specific type. 
  # When working with a lot of files and you only really want ScSm for example.
  type = NULL, 
  # Use parallel read when more than 1 DSN are passed to `dsn`
  parallel = TRUE,
  # Something lower level to optionally tweak chunk size etc
  # For power users only
  parallel.options = list(chunk.size = 10, ...),  
  # Print a progress bar if more than 1 DSN are passed to `dsn`
  progress = TRUE 
)

Something that is needed at the moment is to provide clarity about the different elements of data returned by the driver. Eg most users would likely be after the data contained in x$refl_no_atm_comp$data, but it's obviously not looking trivial at the moment.

you are right, the output is kind of an issue. i think @philipp-baumann thought about a filter option. but i see what you mean, we could add a param = T/F option to get the full output or just the data...
i like your idea of returning a matrix if multiple files are read :)

Yes, I imagine a short iteration of the above:

Because read_opus() is high-level, we need a few short and concise arguments. These unleash the power in straight-forward manner, both for simple and quite advanced use cases. In that sense it is worth to invest some time in design thinking.

  • On individual measurements, there is so-called "in"-filters and "out_filters", specified with filter_in and filter_out. In @pierreroudier 's handy filter above would be in the out_filter class. There is many of these filters, but they will receive an extensive and simple description.

To grab specific elements, like x$refl_no_atm_comp$data, we have already thought of an out filter wrapper.

What we also have to consider is a datatype_out option. This is essentially the counterpart driver for dsn. @pierreroudier @ThomasKnecht do you have a better name for it? I imagine writing to .Rds, standardized json but also a dedicated matrix-attribute like in-memory spectral data structure in R. For the latter, we could collaborate with @l-ramirez-lopez (ping :)).

Regarding chunking: I would go with the registered cores. Parallel resource management should be done by the user. We can further simplify the function interface by adding one main parallel argument, like parallel_args, that are individually dispatched.

so a short update of the proposal, although we probably need to update again before we get to stable:

res <- read_opus(
  # Pass a vector of DSN, can be files or raw sources, can be more than one
  dsn = my_dsn,
  filter_in = my_named_list_of_filters_in,
  filter_out = my_named_list_of_filters_out,
  datatype_out = my_list_of_datatypes_out,
  parallel_args = my_named_list_of_parallel_args,
  progress_bar = FALSE
)

and with some more special features:

  • filter_in: includes reprex filters on directory and file names to be read. in addition, there is also time_from and time_to that accepts posix-style datetime to filter specific measurements done during a time period.

you are right, the output is kind of an issue. i think @philipp-baumann thought about a filter option. but i see what you mean, we could add a param = T/F option to get the full output or just the data... i like your idea of returning a matrix if multiple files are read :)

Let's put this param early in the arg list for filter_out.

filter_out

Another iteration, based on the discussion with @ThomasKnecht . I think the previous draft was a bit too complicated

res <- read_opus(
  dsn = my_dsn,
  output_datatype,
  output_path = NULL,
  parallel = FALSE,
  progress_bar = FALSE
)

instead of ouput_datatype -> data_only = F

+1 for this Mr. Propper.

res <- read_opus(
  dsn = my_dsn,
  data_only = FALSE, # default is all blocks, `TRUE` only last spectra
  output_path = NULL, # driver for the output
  parallel = FALSE,
  progress_bar = FALSE
)

It;s unclear to me what output_path means

It;s unclear to me what output_path means

6936623

output_path optional storage path for the parsed output. Default is NULL, which only returns the parsed data as an in-memory R object.

@pierreroudier do you have a better name suggestion than output_path?

I implemented the data_only as well as the parallel parts

@philipp-baumann @pierreroudier what do you think of that?

the data_only I also implemented in the read_opus_impl -> I think this is a nice to have in the base functionality anyways...

I am not so happy about my implementation of the data_only. it seems rather complicated... maybe you have a nicer suggestion?

I also fixed a bug in the prep_spectra function ;)

I implemented the data_only as well as the parallel parts

@philipp-baumann @pierreroudier what do you think of that?

the data_only I also implemented in the read_opus_impl -> I think this is a nice to have in the base functionality anyways...

I am not so happy about my implementation of the data_only. it seems rather complicated... maybe you have a nicer suggestion?

Awesome, thanks for this! short comments:

  • we will need to do chunked, reading, so I will add some chunking logic to it (when 1 file takes about 20ms to read, the overhead is other wise too high that this has any realistic benefit
  • I will think about data_only and come to you later this week.

Sorry, I am still a bit confused by output_path, is that an option to write the data to disk? In which case, what format would be used to write to file(s)?

I'd be tempted to leave such a functionality outside read_opus, but maybe I'm missing something (large collections that don;t fit into memory?).

For data_only, what is the type returned? Matrix, or still a (trimmed) list?

Opinionated comment on the arg names -- I like to keep them short, eg progress rather than progress_bar, maybe folder rather than output_path, etc.

Sorry, I am still a bit confused by output_path, is that an option to write the data to disk? In which case, what format would be used to write to file(s)?

I'd be tempted to leave such a functionality outside read_opus, but maybe I'm missing something (large collections that don;t fit into memory?).

For data_only, what is the type returned? Matrix, or still a (trimmed) list?

Maybe it was a bit many things in one issue, or respectively the merge that followed ;-) Yes indeed, output_path is an option to write data to disk. This could be for example in .Rds format, or in .json; the latter needs another transformation module. Another option is having a zarr-backend, which would mean a chunked n-dimensional array that represents all spectral data. I would like to have such a feature. If we stream data somewhere and have some quality control pipeline afterwards, we can at some point work in cloud environments such as s3 storages etc. At the moment I am not thinking about big data with such simple spectra, but I have sometimes project directories with few thousand OPUS files to be parsed.

For data_only, at the moment it is still a list. We have to think about some good logic. @pierreroudier @ThomasKnecht shall we imply with data_only a matrix, which might be a good simplification in one argument?

Maybe it was a bit many things in one issue, or respectively the merge that followed ;-) Yes indeed, output_path is an option to write data to disk. This could be for example in .Rds format, or in .json; the latter needs another transformation module. Another option is having a zarr-backend, which would mean a chunked n-dimensional array that represents all spectral data.

Ok, got it. I think it would be best to develop the feature as a dedicated function, then once it's properly designed, consider its integration in read_opus. It's an advanced feature.

For data_only, at the moment it is still a list. We have to think about some good logic. @pierreroudier @ThomasKnecht shall we imply with data_only a matrix, which might be a good simplification in one argument?

My argument for this one is that users will either want to have the full metadata (in which case a list shall be returned), or want to quickly assemble data for modelling, in which case a matrix shall be returned.

For the latter, there is code pretty much ready to go in the old {opusreader} package.

I generally like the matrix-support. To keep track of features and the development, I will move on with opening separate issue tickets. Like this it will be more simple to keep track and have a clean git workflow without too much wizardry.

we removed the output_path. we want to make a separate function for writing the data into json files or whatever other format is possible

@pierreroudier , as @ThomasKnecht indicated above, the design decision we are making is to keep the interface of read_opus() as generic as possible and with lowest possible complexity in arguments. This will enable us to have a rock solid interface that does not change over time, that serves its purpose for a broad community of users and contributors. Over time, there will be for sure options to customize specific wishes in helper functions, that are easier to maintain and are more specific to a particular spectroscopy workflow of individual organizations and companies.

Such extended functionalities are for example #36 , #44 .
Closing because read_opus() and functions at various levels have been split by a new modular design with #47

see #47 for the solution