LLNL/UnifyFS

Potential deadlock caused by concurrent sync calls

wangvsa opened this issue · 0 comments

Describe the problem you're observing

I'm observing some TIMEOUT errors when trying to stage-in many files simultaneously.
It seems that concurrent unifyfs_sync() may cause deadlock on the server side.
After some investigations, I found the server side is blocking at the process_pending_sync call in this case:

client A on server 0 --> write/sync file 1 --> owner is server 1
client B on server 1 --> write/sync file 2 --> owner is server 0

if ((ret == UNIFYFS_SUCCESS) && !is_owner) {
/* send the combined list to the owner */
ret = unifyfs_invoke_add_extents_rpc(gfid, total_extents,
combined_extents);
}

@MichaelBrim Is this the cause? Any idea how to fix this?