Open-EO/openeo-r-client

Prevent user from seeing files upon save_result

przell opened this issue · 7 comments

Title Prevent user from seeing files upon save_result
Date 2019-10-30
Issue #39
Category Usability
Description The process of downloading and manually loading the results from openEO backends could be streamlined by taking this complexity and source of error away from the user. So that he can stay in the R-Environment work progress also for following analysis.
Dependencies Standardized result naming and metadata on openEO backends following the STAC catalogue.
Links Facilitates also local prototyping
Priority High
Impact High

Hi Florian,
during two presentations of the r-client there were requests of the potential users to implement a possibility to read the results of a process graph directly into an r variable.
We had talked about this before and with your hints I came up with a very simple use case. It is viable for the format json and one layer.

This cope block is an example request...

# establish the connection
driver_url = "https://openeo.eurac.edu"
user = "guest"
password = "guest_123"

conn = connect(host = driver_url, 
               user = user, 
               password = password, 
               login_type = "basic")

# build process graph
graph = conn %>% process_graph_builder()

# subset options
aoi = list(west = 11.63, south = 46.532, east = 11.631, north = 46.5325)
timespan = c("2016-07-01T00:00:00.000Z", "2016-07-15T00:00:00.000Z")
bands = c("B04", "B08")

# subset
data1 = graph$load_collection(id = graph$data$`SAO_S2_ST_DEM_BRDF_10m_L2A`,     
                              spatial_extent = aoi,
                              temporal_extent = timespan,  
                              bands = bands)

# filter bands
b_red = graph$filter_bands(data = data1,bands = bands[1])
b_nir = graph$filter_bands(data = data1,bands = bands[2])

# calc ndvi
ndvi = graph$normalized_difference(band1 = b_red, band2 = b_nir)

# get maximum value in timespan
reducer = graph$reduce(data = ndvi, dimension = "temporal")
cb_graph = conn %>% callback(reducer, parameter = "reducer", choice_index = 1)
cb_graph$max(data = cb_graph$data$data) %>% cb_graph$setFinalNode()

# set final node of the graph
graph$save_result(data = reducer, format = "json") %>%  # "netcdf" "GTiff"
  graph$setFinalNode()

Here is the part of how to read the result directly into r...

tmp = openeo::compute_result(con = conn, graph = graph, format = "json")
tmp_char = rawToChar(tmp)
tmp_json = jsonlite::fromJSON(tmp_char)

Is there a senseful way to generalize this idea so that it works for all formats and also multilayer objects? Concerning the size of the result we could implement a limit, so that r doesen't run into problems with too large results.

Best, Peter

flahn commented

Let me try to recap this, because I might not fully understand this. You want to compute sample data to see if the process graph is correct (that is the main idea of compute_result and the POST /result). For larger requests you would create a job, run it and download the results later.
But anyway, if you specify a binary file format in the process graph you will end up with a byte stream, which should be stored in a file. Now, you specified JSON as output format which will return a plain text json as result and by parsing with jsonlite::fromJSON you will get probably an array of values.

I do not get the use case, why we should not store binary data in a file? For plain text like JSON it might make sense, but in most cases for binary raster data you would use GDAL or one of their internal drivers to open the data. Or do we not have the rights to write to disk?

If this is really an issue, then I have to look deeper into reading byte streams from memory into GDAL objects. That might be an approach, but I can not guarantee if this really can be done in R with reasonable effort.

Let me try to recap this, because I might not fully understand this. You want to compute sample data to see if the process graph is correct (that is the main idea of compute_result and the POST /result).

Yes this is my intention.

I do not get the use case, why we should not store binary data in a file? For plain text like JSON it might make sense, but in most cases for binary raster data you would use GDAL or one of their internal drivers to open the data. Or do we not have the rights to write to disk?

There are some use cases that have been mentioned:

  • Not breaking the pure R workflow by I/O operations.
  • For small outputs (e.g. timeseries plot for a pixel) it is nicer not to save a text file to disc and load it and then generate the graph.
  • Situations where you interactively query opeEO through an app or webgis.

If this is really an issue, then I have to look deeper into reading byte streams from memory into GDAL objects. That might be an approach, but I can not guarantee if this really can be done in R with reasonable effort.

For the little I understand this problem is very format and dimension specific. For csv and json it should be easier to incorporate then for formats like geotiff and netcdf. Unfortunately, up to now I don't completely understand the process of transforming the "raw binary" data related to a format directly into an object in R and why/how it makes a difference to write this to disc first.

Maybe we can discuss the use cases and implementation constraints briefly during the UDF meeting in Münster. Since it is not a time critical point this should be fine.

flahn commented

The data will always be downloaded as a file first. Then the user can chose how to open the file and view the data. It decouples the package from depending on other packages being installed.

I will close this issue now, because there was no further activity.

Feature Request: Stream compute_result output directly to R object (don't know if it's possible). Or have some kind of convenience function for loading the result directly? Personally I like it when I don't have to load the file explicitly. Up for discussion.

convenience function:
st_as_stars process node, either after save_result or instead of save_result.
save it either as tmp file (tmp folder of operating system) or user defined path.

Also adds integration to rspatial community.

this will also help for the concept of local prototyping
#94

flahn commented

From the view of user, I would expect that this feature concerns just the synchronous strategy via compute_result. In most cases the result would be a raster or a raster timeseries. But in the future we might expect that vector data is also of the essence (polygonal aggregation, point sampling).

I would suggest a parameter like parsed or interpreted as TRUE or FALSE to control this. Or if the output file parameter is missing it might be tried to interpreted as well. From the technical perspective the following would happen:

  • start processing at the /results endpoint
  • store the returned value as file (either explicitly named or with temporary files
  • open the file and create an R object of it

Notes: we will probably loose metadata along the line, because we cannot cover every back-ends file exchange. So users need to be aware of this to put labels to the dimensions manually if neccessary. The one thing we might guarantee is that the data values are somehow in R.