thustorage/Sherman

Question about write completion and visbility

Closed this issue · 4 comments

Hi, thanks for your paper and the open-source repo of Sherman! I have a question after reading the source code: Does the write completion indicate the write is globally visible? For example, based on my understanding, the workflow of Sherman's node split is:

(1) Create the sibling node and copy data
(2) Wait for write completion of (1)
(3) Insert the sibling node pointer into the parent node
(4) Wait for write completion of (3)

To ensure correctness, (1) must be globally visible before (3). When the parent node and sibling node belong to different MS, the RDMA ordering rule can not be applied here due to different QPs. If write completion can not guarantee visibility (the written data may still reside in MS RNIC buffer), (3) may be "reordered" before (1), since (1) may become visible after (3). I think an RDMA READ to the sibling node's MS should be added before (3) in this case, but I didn't find it in the source code (maybe I missed it).

So my question is does the write completion guarantee visibility? Given that some papers about RDMA+PM [1] suggest that not only does write completion not guarantee durability, it also does not guarantee visibility. Thanks!

[1] Challenges and Solutions for Fast Remote Persistent Memory Access ("The test works by proving that such an RDMA write may not even be visible in the server’s memory hierarchy")

Hi,
RDMA has strong consistency guarantee: consider a client issue RDMA WRITE to a remote RNIC; when it receives an ACK,
all subsequent RDMA READ requests via the same RNIC will see it. However, due to RNIC bufferring, the read from CPU or other RNICs maybe cannot see the writes (i.e., refer to [1]).

In short, visibility is guarantee in the same RNIC.
If there is no visibility guarantee within the same RNIC, all RDMA systems (e.g., DrTM, FaRM) need flush RNIC buffers after an request (to make the subsequent requests see the writes); but it is not ture.

Thank you for your reply! But RDMA does not guarantee consistency across different RNICs. For the Sherman's node split case (the procedure is as follows), if the sibling node and parent node belong to different MS, (1) and (3) will write to different RNICs. Will other clients see (3) before (1)? If this happens, other clients may read an half-written sibling node.

(1) Copy data to sibling node
(2) Wait for ACK of (1)
(3) Insert the sibling node pointer into the parent node
(4) Wait for ACK of (3)

But when another client fetches sibling via RDMA READ, it will see data in (1)

Thank you very much! And I apologize for my misunderstanding. RDMA strong consistency ensures that subsequent requests via the same RNIC, regardless of whether they belong to the same QP as the WRITE or not, can see this write. I'll close this issue.