Got completion with error

Hi, thanks for your open source repo of Sherman, we are happy that we can run Sherman on our cluster to learn more about this system.

The issue

We encountered protection error and deadlock running multithread and multi-machine benchmarks.

Instructions executed

We use the following instructions on each machine to run multithread and multi-machine benchmarks, which produce runtime errors. The Memcached server is on a third machine.

./hugepage.sh
./restartMemc.sh
./benchmark 2 100 4

We run the following instructions to run the single-thread and single-machine benchmark, which runs well

./hupages.sh
./restartMemc.sh
./bencharmk 1 100 1

The total number of huge pages in the hugepage.sh is modified to 4096 to reduce prepare time and huge page size is 2MiB.

Error messages

We were able to run a single-thread benchmark on a single machine, but we encountered the following errors when running multithread and multi-machine tests.

Machine configuration

As shown above, RDMA poll failed due to protection error, and deadlock was detected. We are not sure whether this is caused by the wrong hardware configuration or software bugs. The machine configuration is as follows:

The hardware configuration seems to meet the requirement of Sherman (OFED version and firmware version).

Analysis

The protection error is caused by access to invalid memory regions, but we are not sure whether this is caused by software bugs or the wrong hardware setup. The deadlock error is also confusing because the benchmarks are read-only. Can you give us some tips to debug these errors?

Hi, ./restartMemc.sh only needs to be executed once before each run: execute it on machine 1, but not machine 2.
Besides, the memached consumes almost no system resources, so you can co-locate it with Sherman processes.

Thanks for your quick reply, we now can run Sherman successfully!

Sorry for reopening this issue, but when running multi-machine benchmarks, we have the following errors when the thread number exceeds 4:
on machine 0

on machine 1

And if we start the two servers at almost the same time, we have an assertion failure Assertion page->hdr.sibling_ptr != GlobalAddress::Null() failed

Can you provide a screenshot of the entire test?

Sorry for my late reply.

I cannot see the complete output of server 0 (right part of screenshot )

The missing part is below

and the registering 8589934592 memory region is some output added by us to see the execution process of Sherman (these outputs are too long and repetitive. I can capture them all)

Can you check if the error is triggerred when performing

Sherman/src/Tree.cpp

Line 67 in 76e208b

auto root_addr = dsm->alloc(kLeafPageSize);

or

Sherman/src/Tree.cpp

Line 71 in 76e208b

dsm->write_sync(page_buffer, root_addr, kLeafPageSize);

?

Sherman/src/Tree.cpp

Line 74 in 76e208b

bool res = dsm->cas_sync(root_ptr_ptr, 0, root_addr.val, cas_buffer);

The above line triggers the error

Is it OK when the number of threads is 2?
Can you print the information of related variables?

Unfortunately currently 2-thread benchmark fails too and error messages are the same (I wonder if maybe I should reboot the machines after each run?)
I have the following variables with -O0 optimization:

The root_addr.val's hex value is 0x20000000001. It doesn't look like a valid value.

How about a single thread in each machine? Please check RDMA network state via running ibv_write_bw.

Running single-thread benchmarks sometime is OK and occasionally produces the same error.

ibv_write_bw works fine and our own programs also work.

This issue is weird because we successfully ran the multithread benchmark on two machines once, but currently it doesn't work, Maybe it is due to some machine state issue?

Can you insert while(true) {} after

Sherman/test/benchmark.cpp

Line 258 in 76e208b

tree = new Tree(dsm);

?
Let's check if these two servers can init the tree successfully

Sorry for my so late reply, I'm currently busy on another project.
The two servers can init the tree successfully after adding the loop.

Hi, can you send your WeChat ID via q-wang18@mails.tsinghua.edu.cn ? we can communicate more efficiently through WeChat

Thank you so much for your help and I've sent my ID to you.