opensvc/multipath-tools

Failed to send SCSI Registration to anyone of LUNs after mapping a large number of extern LUNS.

Closed this issue · 1 comments

Procedure:

Step1: 1024 LUNs (16 paths) for an external storage are mapping to the host, then run rescan-scsi-bus.sh to produce 1024 disk.
Step2: Manually issues a registration command to one of LUNs, receive timeout error.
mpathpersist -o -I -S 0x000000003320095c /dev/dm-117
But if I sent a registration command with sg_persist -o -I -S 0x000000003320095c /dev/dm-117, that was successful.
According the error log, I found that mpathpersist send msg of saving prkey to multipathd timeout when I config reservation_key:

defaults {
	path_checker            tur
	no_path_retry           18
	path_grouping_policy    group_by_prio
	prio                    const
	deferred_remove         yes
	uid_attribute           "ID_SERIAL"
	reassign_maps           no
	failback                immediate
	log_checker_err         once
	reservation_key         "file"  // this item
}

Root Cause: The recv package cannot be recievd after fixed 4 seconds timeout, because multipathd spent more than 4 seconds to excute PARSE, which triggers vector lock collision with checkerloop.

#define DEFAULT_REPLY_TIMEOUT	4000
static int do_update_pr(char *alias, char *arg)
{
        ......
	condlog (2, "%s: pr message=%s", alias, str);
	if (send_packet(fd, str) != 0) {
		condlog(2, "%s: message=%s send error=%d", alias, str, errno);
		mpath_disconnect(fd);
		return -1;
	}
	ret = recv_packet(fd, &reply, DEFAULT_REPLY_TIMEOUT);
	if (ret < 0) {
		condlog(2, "%s: message=%s recv error=%d", alias, str, errno);
		ret = -1;
	}
       ......
}

Solution Suggestion: Modify client timeout to uxsock_timeout value rather than DEFAULT_REPLY_TIMEOUT , that will be consistent with server, and that would make more sense: Client wait timeout should be more than Server excecution Timeout,

considering the transmission delay. After that, uxsock_timeout in /etc/multipath.conf can be modified to more than default value such as 10 seconds.

I don't understand.

multipathd spent more than 4 seconds to excute PARSE

what does this mean? What do you mean with PARSE, and how is it possible that it took 4 seconds?
Can you fix this by simply increasing the timeout?

Btw which multipath-tools version were you using?