Failed to send SCSI Registration to anyone of LUNs after mapping a large number of extern LUNS.
Closed this issue · 1 comments
Procedure:
Step1: 1024 LUNs (16 paths) for an external storage are mapping to the host, then run rescan-scsi-bus.sh to produce 1024 disk.
Step2: Manually issues a registration command to one of LUNs, receive timeout error.
mpathpersist -o -I -S 0x000000003320095c /dev/dm-117
But if I sent a registration command with sg_persist -o -I -S 0x000000003320095c /dev/dm-117, that was successful.
According the error log, I found that mpathpersist send msg of saving prkey to multipathd timeout when I config reservation_key:
defaults {
path_checker tur
no_path_retry 18
path_grouping_policy group_by_prio
prio const
deferred_remove yes
uid_attribute "ID_SERIAL"
reassign_maps no
failback immediate
log_checker_err once
reservation_key "file" // this item
}
Root Cause: The recv package cannot be recievd after fixed 4 seconds timeout, because multipathd spent more than 4 seconds to excute PARSE, which triggers vector lock collision with checkerloop.
#define DEFAULT_REPLY_TIMEOUT 4000
static int do_update_pr(char *alias, char *arg)
{
......
condlog (2, "%s: pr message=%s", alias, str);
if (send_packet(fd, str) != 0) {
condlog(2, "%s: message=%s send error=%d", alias, str, errno);
mpath_disconnect(fd);
return -1;
}
ret = recv_packet(fd, &reply, DEFAULT_REPLY_TIMEOUT);
if (ret < 0) {
condlog(2, "%s: message=%s recv error=%d", alias, str, errno);
ret = -1;
}
......
}
Solution Suggestion: Modify client timeout to uxsock_timeout value rather than DEFAULT_REPLY_TIMEOUT , that will be consistent with server, and that would make more sense: Client wait timeout should be more than Server excecution Timeout,
considering the transmission delay. After that, uxsock_timeout in /etc/multipath.conf can be modified to more than default value such as 10 seconds.
I don't understand.
multipathd spent more than 4 seconds to excute PARSE
what does this mean? What do you mean with PARSE, and how is it possible that it took 4 seconds?
Can you fix this by simply increasing the timeout?
Btw which multipath-tools version were you using?