ruihong123/dLSM

How to enable the code on multiple servers?

Opened this issue · 3 comments

We have successfully enabled your code in a stand-alone case. But when we try to enable it between two machines, the compute node will appear bug. In function poll_completion() , compute node appears many times "number 0 got bad completion with status: 0xc, vendor syndrome: 0x81", and then memory node appears "RDMA write failed". We know that the function call order is "dLSM::DBImpl::BackgroundFlush()->dLSM::DBImpl::CompactMemTable()->dLSM::DBImpl::WriteLevel0Table()->dLSM::FlushJob::BuildTable()->dLSM::TableBuilder_ComputeSide::Finish()->dLSM::RDMA_Manager::poll_completion"
How can we fix this bug?

Please show me the whole log of the error, maybe I can figure out what was happening.

Mark: valgrind socket info1
searching for IB devices in host
found 2 device(s)
device not specified, using first one found: mlx5_0
New MR was registered with addr=0x7faa0b0e1010, lkey=0x1825e4, rkey=0x1825e4, flags=0xf, size=10240000, total registered size is 0
New MR was registered with addr=0x7faa0a71c010, lkey=0x17fcbc, rkey=0x17fcbc, flags=0xf, size=10240000, total registered size is 10240000
SST buffer, send&receive buffer were registered with a
maximum outstanding wr number is32768
maximum query pair number is131072
maximum completion queue number is16777216
maximum memory region number is16777216
maximum memory region size is18446744073709551615
connect to node id 0QP was created, QP number=0x25d7

Local LID = 0x0
total bytes: 23read byte: 23Remote QP number = 0x6a8
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4
QP 0x7faa040022b8 state was change to RTS
total bytes: 1read byte: 1Finish the connection with node 0
New MR was registered with addr=0x7fa9c3fff010, lkey=0xac17, rkey=0xac17, flags=0xf, size=1073741824, total registered size is 20480000
dLSM: version 1.22
Date: Fri Aug 18 03:19:54 2023
Start to sync options
client handling thread
CPU: 80 * Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
CPUCache:
Keys: 16 bytes each
Values: 32 bytes each (16 bytes after compression)
Entries: 10000000
RawSize: 457.8 MB (estimated)
FileSize: 305.2 MB (estimated)
WARNING: Optimization is disabled: benchmarks unnecessarily slow
WARNING: Assertions are enabled; benchmarks unnecessarily slow
WARNING: Snappy compression is not enabled

DBImpl start
New MR was registered with addr=0x7fa9c1ffe010, lkey=0x33fff, rkey=0x33fff, flags=0xf, size=33554432, total registered size is 1094221824
Memory used up, Initially, allocate new one, memory pool is Version_edit, total memory this pool is 1
RDMA write successfully
communication thread created
DBImpl finished
level 0 file equals 0 marker
Version get garbage collected
version garbage collected.
level 0 file equals 0 marker
Version get garbage collected
version garbage collected.
May be schedule a background task!
DBImpl deallocated
May be schedule a background task!
May be schedule a background task!
Version get garbage collected
version garbage collected.
remained versuins number is 199344864version garbage collected.
Memtable 0x55d8be288600 deallocated
Total number of entries within the cahce is 0DBImpl start
RDMA write successfully
communication thread created
DBImpl finished
level 0 file equals 0 marker
Version get garbage collected
version garbage collected.
level 0 file equals 0 marker
Version get garbage collected
version garbage collected.
May be schedule a background task!
The second open finished.
The benchmark start.
validation write finished
start front-end threads
Wait for thread start
total bytes: 1read byte: 1sync wait time is 227873Threads start to run
Add a new file, current immtable number is 1mark in the ref
May be schedule a background task!
flushing thread pool task queue length 0
Schedule a flushing !
table picked is 1picked metable number is 1new file number for flushing is 4
New MR was registered with addr=0x7fa973fff010, lkey=0xcb18, rkey=0xcb18, flags=0xf, size=1073741824, total registered size is 1127776256
Memory used up, Initially, allocate new one, memory pool is FlushBuffer, total memory this pool is 1
New MR was registered with addr=0x7fa933ffe010, lkey=0x1c919, rkey=0x1c919, flags=0xf, size=1073741824, total registered size is 2201518080
Memory used up, Initially, allocate new one, memory pool is IndexChunk, total memory this pool is 1
Add a new file, current immtable number is 2mark in the ref
May be schedule a background task!
flushing thread pool task queue length 0
Schedule a flushing !
table picked is 1picked metable number is 1new file number for flushing is 5
New MR was registered with addr=0x7fa8f3ffd010, lkey=0x20a1a, rkey=0x20a1a, flags=0xf, size=1073741824, total registered size is 3275259904
Memory used up, Initially, allocate new one, memory pool is FilterChunk, total memory this pool is 1
Remote memory registeration, size: 1073741824
polled reply bufferr
QP was created, QP number=0x25d8

QP num to be sent = 0x25d8
Local LID = 0x0
QP was created, QP number=0x25d9
Polling reply buffer
QP num to be sent = 0x25d9
Local LID = 0x0uffer
Remote QP number=0x6a9
Remote LID = 0x0ffer
Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4
QP 0x7fa9b8005bd8 state was change to RTS
Remote QP number=0x6aa
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4
QP 0x7fa9b40088f8 state was change to RTS
For flush, Total number of key touched is 153846, KV left is 152656
One more local write buffer is added, now 3 total
sst offset is 9627984
For flush, Total number of key touched is 153846, KV left is 152722
One more local write buffer is added, now 3 total
sst offset is 9618133
BloomFilter block size is 190922index block size: 36543
start of the this block is0, 20, 3, 0, 0, 0, 0, 0, 0, 43, 210, 1, 377, 377, 377, 377, 377, 377, 377, 0, 303, 77, 0, 20, 4, 0, 0, 0, 0, 0,
BloomFilter block size is 190922index block size: 36444
start of the this block is0, 31, 3, 0, 0, 0, 0, 0, 0, 0, 241, 60, 60, 60, 60, 60, 60, 60, 60, 166, 1, 241, 0, 0, 0, 0, 0, 0, 0, 303,
Add a new file, current immtable number is 3mark in the ref
May be schedule a background task!
flushing thread pool task queue length 0
Schedule a flushing !
table picked is 1picked metable number is 1new file number for flushing is 6
QP was created, QP number=0x25da

QP num to be sent = 0x25da
Local LID = 0x0
Remote QP number=0x6ab
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4
QP 0x7fa8ec005bb8 state was change to RTS
For flush, Total number of key touched is 153846, KV left is 152630
One more local write buffer is added, now 3 total
sst offset is 9626346
BloomFilter block size is 190922index block size: 36559
start of the this block is0, 30, 3, 0, 0, 0, 0, 0, 0, 41, 371, 60, 60, 60, 60, 60, 60, 60, 60, 1, 337, 350, 5, 0, 0, 0, 0, 0, 303, 77,
Add a new file, current immtable number is 4mark in the ref
May be schedule a background task!
flushing thread pool task queue length 0
Schedule a flushing !
table picked is 1picked metable number is 1new file number for flushing is 7
QP was created, QP number=0x25db

QP num to be sent = 0x25db
Local LID = 0x0
Remote QP number=0x6ac
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4
QP 0x7fa8e4005bb8 state was change to RTS
For flush, Total number of key touched is 153846, KV left is 152749
One more local write buffer is added, now 3 total
sst offset is 9633852
BloomFilter block size is 191050index block size: 36624
start of the this block is0, 20, 3, 0, 0, 0, 0, 0, 0, 40, 135, 1, 377, 377, 377, 377, 377, 377, 377, 0, 303, 77, 0, 20, 4, 0, 0, 0, 0, 0,
Add a new file, current immtable number is 5mark in the ref
May be schedule a background task!
flushing thread pool task queue length 0
Schedule a flushing !
number 0 got bad completion with status: 0xc, vendor syndrome: 0x81
db_bench: /home/zqy2023/dLSM/util/rdma.cc:2599: int dLSM::RDMA_Manager::poll_completion(ibv_wc*, int, std::string, bool, uint8_t): Assertion `false' failed.
Aborted (core dumped)

gdb db_bench core
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007faa0bc51859 in __GI_abort () at abort.c:79
#2 0x00007faa0bc51729 in __assert_fail_base (fmt=0x7faa0bde7588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=0x55d8bd531ad1 "false", file=0x55d8bd531c68 "/home/zqy2023/dLSM/util/rdma.cc", line=2599,
function=) at assert.c:92
#3 0x00007faa0bc62fd6 in __GI___assert_fail (assertion=0x55d8bd531ad1 "false",
file=0x55d8bd531c68 "/home/zqy2023/dLSM/util/rdma.cc", line=2599,
function=0x55d8bd532b48 "int dLSM::RDMA_Manager::poll_completion(ibv_wc*, int, std::string, bool, uint8_t)")
at assert.c:101
#4 0x000055d8bd4f9b34 in dLSM::RDMA_Manager::poll_completion (this=0x55d8bd9f5170, wc_p=0x7faa08a4e750, num_entries=4,
qp_type="write_local_flush", send_cq=true, target_node_id=0 '\000') at /home/zqy2023/dLSM/util/rdma.cc:2599
#5 0x000055d8bd4e1c9d in dLSM::TableBuilder_ComputeSide::Finish (this=0x7fa9b8000ca0)
at /home/zqy2023/dLSM/table/table_builder_computeside.cc:618
#6 0x000055d8bd49f3cc in dLSM::FlushJob::BuildTable (this=0x7faa08a4eb30, dbname="/tmp/dLSMtest-1010/dbbench", env=
0x55d8bd582ce0 dLSM::Env::Default()::env_container, options=..., table_cache=0x55d8be2890e0, iter=0x7fa9b8000cc0,
meta=std::shared_ptrdLSM::RemoteMemTableMetaData (use count 2, weak count 0) = {...}, type=dLSM::Flush,
target_node_id=0 '\000') at /home/zqy2023/dLSM/db/memtable_list.cc:892
#7 0x000055d8bd46cd2e in dLSM::DBImpl::WriteLevel0Table (this=0x55d8be288600, job=0x7faa08a4eb30, edit=0x7faa08a4ebc0)
at /home/zqy2023/dLSM/db/db_impl.cc:791
#8 0x000055d8bd46cffd in dLSM::DBImpl::CompactMemTable (this=0x55d8be288600) at /home/zqy2023/dLSM/db/db_impl.cc:997
#9 0x000055d8bd46da4c in dLSM::DBImpl::BackgroundFlush (this=0x55d8be288600, p=0x0) at /home/zqy2023/dLSM/db/db_impl.cc:1220
#10 0x000055d8bd46d902 in dLSM::DBImpl::BGWork_Flush (thread_arg=0x55d8be289c40) at /home/zqy2023/dLSM/db/db_impl.cc:1182
#11 0x000055d8bd4ce2c6 in std::_Function_handler<void (void*), void ()(void)>::_M_invoke(std::_Any_data const&, void*&&) (
__functor=..., __args#0=@0x7faa08a4ed20: 0x55d8be289c40) at /usr/include/c++/9/bits/std_function.h:300
#12 0x000055d8bd4cb9d9 in std::function<void (void*)>::operator()(void*) const (this=0x7faa08a4ed90, __args#0=0x55d8be289c40)
at /usr/include/c++/9/bits/std_function.h:688
#13 0x000055d8bd4c9fa4 in dLSM::ThreadPool::BGThread (this=0x55d8bd582dc0 dLSM::Env::Default()::env_container+224)
at /home/zqy2023/dLSM/./util/ThreadPool.h:74
#14 0x000055d8bd4d5282 in std::__invoke_impl<void, void (dLSM::ThreadPool::)(), dLSM::ThreadPool>

Hello, do you solve the problem?