Open-EO/openeo-r-client

Rework the process execution behaviour

Closed this issue · 11 comments

flahn commented

With API v0.0.2 there are some major changes in the process management. On the POST /jobs there is no longer an evaluate parameter. The former sync call is now POST /execute. batch is now called with PATCH /jobs/{job_id}/queue and lazy is done via POST /services.

The new 'batch' and 'lazy' call requires an already uploaded process graph and optional output parameter. This is done by POST /jobs now.

This changes require some renaming and new functions for the R client. I suggest that job creation (without determining whether to run in 'batch' or 'lazy') is called createJob. Then, to call the 'lazy' evaluation, we call the function toService and queueJob will match the similar named call on the backend for the 'batch' evaluation.
This means the following functions will be removed: queueTask, orderResult.

The naming can be change and I'm open to suggestions.

I'm not sure why you'd want to change the function names. I'd stick with the ones we already have.

Yes, you do need to upload the process graphs now, but that's what queueTask was made for to begin with. Asking the user to run createJob and queueJob right afterwards doesn't make sense. Not sure what toService is supposed to do, though; but either way it can have createJob inside it so the user doesn't have to run it.

flahn commented

The main reason why I thought about renaming was the term collision of 'queue'. In the API this means now to execute in the former batch mode and the client would use it for 'lazy' evaluation.

Regarding toService it will take on the job that the former queueTask would have done (creating a service from a job). For the integrated createJob, that is a fair point and should be implemented in that way.

Then another change was that we have not a GET /jobs/{job_id}/download to get the results of a 'batch' job. That function would need a name as well, maybe something like jobResult, and this function cannot be integrated into the former orderResult function, because then it would be a synchronous call.

Hrm, I missed that part... What exactly does the /jobs/{job_id}/queue endpoint do? The API says it converts it to batch mode, but I thought that batch mode was supposed to become merely a special case of lazy (you call something like get_download_links() instead of a direct download()).

Is toService the /services endpoint, so it refers to the W*S services?

For the latter, there's downloadJob() already.

flahn commented

OK, the general approach for 'batch' and 'lazy' is now, that you upload your job onto the backend via POST /jobs. The job will remain in a status where it is just submitted. Afterwards the user can decide if they want to publish it as a W*S service via POST /services or if they want to start the batch processing via PATCH /jobs/{job_id}/queue.
If you want to download the 'batch' results, then you will call GET /jobs/{job_id}/download.

Yea, though the API description also says that the W*S service can read from the result of the batch processing as well.

So does PATCH /jobs/{job_id}/queue actually start the processing? If so, I get why they'd want to have this distinction in the API (GET is supposed to be synchronous, I guess). But at the same time, that doesn't give the backend information on where and how to store the output, does it?

Ah, I looked at the examples again, and yea it says that it starts processing when the PATCH is sent, the output format data is part of the job definition.

In that case, we should have one function that sends the job and starts processing (orderResult() was the idea: define the job + PATCH /jobs/id/queue), one function that gives a W*S link (this is new: define a job + POST /services), and our current executeTask() for synchronous. And then for expert applications (if someone wants both a W*S and a list of download links), one could use functions that are separate for the two steps (defining jobs was the point of queueTask()).

I can see the issue of awkward naming, where orderResult() calls /queue and queueTask() calls /jobs, but the function names are user-facing, whereas the API is internal (and still subject to change). For a user, if they want to submit a job to get a list of files to download, that's an order. And if they just want to put up the job to later do something with it, that's a queue (arguably; defineJob or so would be fine as well, especially since it's consistent with defineUDF).

flahn commented

How the services should behave, i also didn't understand fully. But in my opinion, it seems that we are allowed to do:

  1. upload job -> create service
    --("submitted")----("running")
  2. upload job -> start batch evaluation -> (wait until finished) -> create service
    --("submitted")------("queued")----------------("finished")--------("running")

In 1) we create a service from a process graph directly, meaning we compute on the fly for the W*S (in terms of an use case: calculate the ndvi of the latest xxx collection), or in 2) you calculate your data via 'batch' (PATCH /jobs/{job_id}/queue) and then you offer those information as a service (then there should be no computing). The values in parenthesis are the allowed stati of the job object (see job_status object in the api).

Hm, right, that would make sense. But in case 2, we basically have the download links for free anyway. So we could have orderResult() give the Job ID, and then allow passing that to the new function (toService(), in my example we also already have getWCSLink(), but I guess something like getWebServiceLink() might be more understandable?). In case 1, we'd use defineJob() (or so) + getWebServiceLink() (or so).

flahn commented

https://github.com/Open-EO/openeo-r-client/wiki/Development-Overview-v0.2.x to have an overview about endpoints, client functions and wrapper functions. I will try my best to keep it up to date with the develop-branch

Nice, that helps.
The examples should also be updated. I'll add a new issue about that.

Hm, for listing capabilities and formats, I think the wrapper functions should start with list just like for collections and processes.