Got completion with error
Dicridon opened this issue · 18 comments
Hi, thanks for your open source repo of Sherman, we are happy that we can run Sherman on our cluster to learn more about this system.
The issue
We encountered protection error and deadlock running multithread and multi-machine benchmarks.
Instructions executed
We use the following instructions on each machine to run multithread and multi-machine benchmarks, which produce runtime errors. The Memcached server is on a third machine.
./hugepage.sh
./restartMemc.sh
./benchmark 2 100 4
We run the following instructions to run the single-thread and single-machine benchmark, which runs well
./hupages.sh
./restartMemc.sh
./bencharmk 1 100 1
The total number of huge pages in the hugepage.sh
is modified to 4096 to reduce prepare time and huge page size is 2MiB.
Error messages
We were able to run a single-thread benchmark on a single machine, but we encountered the following errors when running multithread and multi-machine tests.
Machine configuration
As shown above, RDMA poll failed due to protection error, and deadlock was detected. We are not sure whether this is caused by the wrong hardware configuration or software bugs. The machine configuration is as follows:
The hardware configuration seems to meet the requirement of Sherman (OFED version and firmware version).
Analysis
The protection error is caused by access to invalid memory regions, but we are not sure whether this is caused by software bugs or the wrong hardware setup. The deadlock error is also confusing because the benchmarks are read-only. Can you give us some tips to debug these errors?
Hi, ./restartMemc.sh
only needs to be executed once before each run: execute it on machine 1, but not machine 2.
Besides, the memached consumes almost no system resources, so you can co-locate it with Sherman processes.
Thanks for your quick reply, we now can run Sherman successfully!
Sorry for reopening this issue, but when running multi-machine benchmarks, we have the following errors when the thread number exceeds 4:
on machine 0
on machine 1
And if we start the two servers at almost the same time, we have an assertion failure Assertion page->hdr.sibling_ptr != GlobalAddress::Null() failed
Can you provide a screenshot of the entire test?
I cannot see the complete output of server 0 (right part of screenshot )
Is it OK when the number of threads is 2?
Can you print the information of related variables?
How about a single thread in each machine? Please check RDMA network state via running ibv_write_bw
.
Running single-thread benchmarks sometime is OK and occasionally produces the same error.
ibv_write_bw works fine and our own programs also work.
This issue is weird because we successfully ran the multithread benchmark on two machines once, but currently it doesn't work, Maybe it is due to some machine state issue?
Can you insert while(true) {}
after
Line 258 in 76e208b
?
Let's check if these two servers can init the tree successfully
Sorry for my so late reply, I'm currently busy on another project.
The two servers can init the tree successfully after adding the loop.
Hi, can you send your WeChat ID via q-wang18@mails.tsinghua.edu.cn
? we can communicate more efficiently through WeChat
Thank you so much for your help and I've sent my ID to you.