Seagate/cortx-motr-apps

isc_demo does not really demonstrate in-storage compute

Closed this issue · 21 comments

ISC stands for In-Storage Compute (aka Function Shipping). The idea is that some computations on the data (like MIN/MAX) can be done closer to where the data is stored (directly at the storage nodes) and the resulting value is sent to the client from each server for the final computation.

But currently, in isc_demo, the data for the computation is sent to servers from the client instead of fetching it from the storage. So it does not actually demonstrate the original idea of computation being done closer to where the data is stored.

I see two ways to tackle this issue in terms of how to send the computation requests:

  1. Figure out where the data units are located on the client side (target services and cob fids are determined from the gob fid the same way as client API does for normal i/o) and send the requests only to these target services.
  2. Send requests to all ISC services with gob fid and let the services figure out the location of the correspondent cobs to fetch the data from. (This is what c0isc_demo.c code does currently - it sends the request to all services. It only sends the data in the request itself instead of fetching it from the storage at the server side.)

The 1st approach seems more natural to overall Motr architecture and philosophy. Also, the 2nd approach would generate a lot of needless requests for small objects on large clusters.

The only advantage of the 2nd approach is that it might provide a simpler API. But maybe the same can be done with the 1st approach also? We just need some API which for the given gob fid would return a list of target ioservices along with the list cob fids for each of them.

@nikitadanilov, @siningwuseagate, @huanghua78 what do you think?

Maybe we can combine the benefits of the two approaches. From the client side we just need a list of services where the cobs are located and where to send the computation requests. At the server side having the gob fid the concrete cobs can be located for the data fetching. Or we can get the list of cob fids at the client side and send them to server ready. (This is probably even better as we could re-use most of the ioservice code to fetch the data at the server side.) This way we don't need to send many requests for each cob to the same service, but only one.

It is desirable that function shipping implementation is fault-tolerant and generic.

Fault-tolerance means that if some parts of the object are not available because of storage or server failures, function shipping should reconstruct the missing data and execute the specified function against the reconstructed data.

Genericity means that function shipping does not depend on the details of parity de-clustered layout. It should be possible to ship functions to objects with other types of layout.

I do not understand the point about simpler user API in your 2nd approach. As far as I remember function shipping interface was supposed to be something like m0_ship(fid, function), where 'fid' is the fid of an object (or an index or a container) and 'function' identifies the computation to be executed on object data (or index key-value pairs or fids of container members). This interface should not depend on the details of function shipping implementation.

You might be interested in http://gerrit.mero.colo.seagate.com/plugins/gitiles/mero/+/refs/heads/dev/layout-plan/layout/plan.h that defines never implemented 'layout plan' interface that would allow a generic and fault-tolerant implementation of function shipping.

The function shipping layer should not itself be fault-tolerant, no? Doesn't it layer on top of motr which is the proper layer for fault-tolerance. If an object is not available due to failures, function shipping should just ask motr for it and motr should do the reconstruction. We shouldn't replicate this logic certainly.

ISC stands for In-Storage Compute (aka Function Shipping). But currently, the data for the computation is sent from the client instead of fetching it from the storage. So it has nothing to do with the storage atm, which is a shame and must be fixed.

This is about the example app. This app does not properly demonstrate how the ISC is working.
Actually, if we want to demo a MIN/MAX computing, we should read data from object (cobs) and then compute MIN/MAX from the data.

This app sends some data to server, and asks server to do something. This is some kind of computing in storage, but it mislead the user to think that the data should be sent to server by user. This is wrong.

ISC itself, the server side, can read data from cobs, process them, and then write results to cobs.

@nikitadanilov it was mainly about internal interfaces - how to implement it. Not external user API yet. m0_ship(fid, function) sounds very attractive, but we are nowhere near there yet, I'm afraid.

BTW, we probably need at least one more argument in this user API - the size of the object (since Motr does not know/store it for objects): m0_ship(function, object, size).

... function shipping should reconstruct the missing data and execute the specified function against the reconstructed data.

Should this all be done at the server side?

Some thoughts from @siningwuseagate on the matter:

the idea of function shipping is moving compute to the data, if motr be able to expose where the data is (just as hadoop file system does) and allow apps to read data on oservice, it is easier to integrate into other data processing framework such flink and spark. The issue I see for the existing function shipping is difficult to be integrated into other framework and it doesn't have task/compute scheduling, making it difficult to integrate into a data process workflow.

Actually, if we want to demo a MIN/MAX computing, we should read data from object (cobs) and then compute MIN/MAX from the data.

@huanghua78 yes, this is what we want to demonstrate.

Now the question is - how to implement this "read data from object (cobs)". In particular - how to find the object's cobs. Should it be done at the client side and send the requests only to those services which have the object's cobs? Or should we send the requests to all the isc-services configured in the cluster and allow them to find out the object's cobs (if there are any on the particular service)? About pros & cons of each approach see my first two comments above.

Should this all be done at the server side?

@andriytk: reconstruction of missing data block requires reading N (of N+K) blocks from multiple servers. In a200 it was done by motr client (in parity-verify mode) and in r2 it will also be done by a client, in case a server is late to return data. With layout/plan.h (see my previous message), IO (or meta-data) operation maps to a sequence of steps (its 'execution plan') that can be executed anywhere.

BTW, we probably need at least one more argument in this user API - the size of the object (since Motr does not know/store it for objects): m0_ship(function, object, size)

Generally, it's useful to stop thinking about objects as byte streams and instead imagine them as arrays of blocks. 'size' parameter won't be very helpful in case of an object that starts with a terabyte hole followed by a single filled block. If you think that additional flexibility is warranted, something like m0_ship(fid, function, indexvec) might be appropriate.

@nikitadanilov let me clarify my question - I did not ask where it can be executed, but rather - where it should be executed.

Since we are talking about the "In-Store Computation" it would be logical to assume that the reconstruction of missing data blocks should be done at the server side along with the computation. However, such reconstruction might be quite a computation-challenging task by itself which might make sense to do on the client node (usually more powerful) for the missing blocks. What do you think?

Thank you for the plan layout reference, I've started reading it already...

It is desirable that function shipping implementation is fault-tolerant and generic.

Fault-tolerance means that if some parts of the object are not available because of storage or server failures, function shipping should reconstruct the missing data and execute the specified function against the reconstructed data.

Genericity means that function shipping does not depend on the details of parity de-clustered layout. It should be possible to ship functions to objects with other types of layout.

I do not understand the point about simpler user API in your 2nd approach. As far as I remember function shipping interface was supposed to be something like m0_ship(fid, function), where 'fid' is the fid of an object (or an index or a container) and 'function' identifies the computation to be executed on object data (or index key-value pairs or fids of container members). This interface should not depend on the details of function shipping implementation.

You might be interested in http://gerrit.mero.colo.seagate.com/plugins/gitiles/mero/+/refs/heads/dev/layout-plan/layout/plan.h that defines never implemented 'layout plan' interface that would allow a generic and fault-tolerant implementation of function shipping.

@TechWriter-Mayur , please add Nikita's last sentence here to https://github.com/Seagate/cortx-motr/blob/main/doc/reading-list.md in the Function Shipping section and copy plan.h into motr/doc and link to it. Also, change the license appropriately in plan.h. Also, in the Function Shipping section, it links to a PDF. Ask @VenkyOS if he has already converted that PDF to MD/RST. If so, please change the link here. If not, please convert yourself and update the link.

Since we are talking about the "In-Store Computation" it would be logical to assume that the reconstruction of missing data blocks should be done at the server side along with the computation.

@andriytk: I do not understand which server is meant here. If an object is striped across all servers in a pool and one of the servers is down, which server is it logical to use for reconstruction?

Again, I don't think the function shipping code should be doing reconstruction. Isn't this replicating functionality/logic in multiple layers of the code? Isn't that a VERY BAD THING^TM to do in software engineering?

@nikitadanilov I was mainly thinking about the situation when the drive is failed, not server. And I meant that server where the failed drive is located. But you are right, in case of the whole server failure it's not clear which server should do this. Probably it's better to do this on the client side in this case?

@johnbent we are not saying which part of the code should do what yet. We are just trying to imagine how the system should work in general. Of course, the functionality will not be duplicated in multiple layers of the code.

Added PR for the layout/plan.h (updated for the new Motr API) - Seagate/cortx-motr#411, so that we can explore and discuss the interface there.

Motr layout access plan.pdf - a quick introduction into the proposed API.

This looks really good @andriytk .  Plus @kaustubh-d who has been working on some ideas for asynchronous replication.  One is that we do the replication at the S3 layer which suffers from read amplification due to a lack of knowledge of layout.  If we could get layout however, then we could do directed byte-range reads to minimize the amount of total IO.

Donald R Bloyer commented in Jira Server:

[~535110] what else is needed here?  It suggests Andriy has put together a document for this?  Is it exposed now?

 

This is fixed at Seagate/cortx-motr#857.

Chandradhar Raval commented in Jira Server:

Closing this issue as corresponding github issue is closed