zrlio/disni

UNKNOWN, srcAddress /0.0.0.0:0

dacrespi opened this issue · 33 comments

I'm attempting to run crail-spark.
I'm set up as a container running spark with workers, and attempting to just access crail store.
Running either crail fs -ls -R / or say terasort, I end up at the same error.

INFO disni: got event type + UNKNOWN, srcAddress /0.0.0.0:0, dstAddress /192.168.3.100:4420

I've set disni in crail to log DEBUG but I don't get any additional info.

It appears that DiSNI is attempting to set up a QP but is unable to determine the local rNIC's adddress? I do have the container setup as a bridged network, rather than a host network, which I'm guessing could be the issue? I do see all the RDMA nodes from the container however. I tried to use a host network but then got Spark errors because it can't have unique hostnames when attached to the host network.

$ ibv_devices
device node GUID
------ ----------------
i40iw0 0cc47afc00ed0000
mlx5_2 98039b0300989ab6
mlx5_0 98039b0300989b0e
i40iw1 0cc47afc00ec0000
mlx5_3 98039b0300989ab7
mlx5_1 98039b0300989b0f

Snippet of the console output prior to the hang:

19/06/12 08:49:10 INFO crail: CrailHadoopFileSystem construction
19/06/12 08:49:10 INFO crail: creating singleton crail file system
19/06/12 08:49:10 INFO crail: crail.version 3101
19/06/12 08:49:10 INFO crail: crail.directorydepth 16
19/06/12 08:49:10 INFO crail: crail.tokenexpiration 10
19/06/12 08:49:10 INFO crail: crail.blocksize 1048576
19/06/12 08:49:10 INFO crail: crail.cachelimit 0
19/06/12 08:49:10 INFO crail: crail.cachepath /dev/hugepages/cache
19/06/12 08:49:10 INFO crail: crail.user crail
19/06/12 08:49:10 INFO crail: crail.shadowreplication 1
19/06/12 08:49:10 INFO crail: crail.debug true
19/06/12 08:49:10 INFO crail: crail.statistics true
19/06/12 08:49:10 INFO crail: crail.rpctimeout 1000
19/06/12 08:49:10 INFO crail: crail.datatimeout 1000
19/06/12 08:49:10 INFO crail: crail.buffersize 1048576
19/06/12 08:49:10 INFO crail: crail.slicesize 524288
19/06/12 08:49:10 INFO crail: crail.singleton true
19/06/12 08:49:10 INFO crail: crail.regionsize 1073741824
19/06/12 08:49:10 INFO crail: crail.directoryrecord 512
19/06/12 08:49:10 INFO crail: crail.directoryrandomize true
19/06/12 08:49:10 INFO crail: crail.cacheimpl org.apache.crail.memory.MappedBufferCache
19/06/12 08:49:10 INFO crail: crail.locationmap
19/06/12 08:49:10 INFO crail: crail.namenode.address crail://192.168.1.164:9060
19/06/12 08:49:10 INFO crail: crail.namenode.blockselection roundrobin
19/06/12 08:49:10 INFO crail: crail.namenode.fileblocks 16
19/06/12 08:49:10 INFO crail: crail.namenode.rpctype org.apache.crail.namenode.rpc.tcp.TcpNameNode
19/06/12 08:49:10 INFO crail: crail.namenode.log
19/06/12 08:49:10 INFO crail: crail.storage.types org.apache.crail.storage.nvmf.NvmfStorageTier
19/06/12 08:49:10 INFO crail: crail.storage.classes 2
19/06/12 08:49:10 INFO crail: crail.storage.rootclass 0
19/06/12 08:49:10 INFO crail: crail.storage.keepalive 2
19/06/12 08:49:10 INFO crail: buffer cache, allocationCount 0, bufferCount 1024
19/06/12 08:49:10 INFO crail: Initialize Nvmf storage client
19/06/12 08:49:10 INFO crail: crail.storage.nvmf.ip 192.168.3.100
19/06/12 08:49:10 INFO crail: crail.storage.nvmf.port 4420
19/06/12 08:49:10 INFO crail: crail.storage.nvmf.nqn nqn.2018-12.com.StorEdgeSystems:cntlr13
19/06/12 08:49:10 INFO crail: crail.storage.nvmf.hostnqn nqn.2014-08.org.nvmexpress:uuid:1b4e28ba-2fa1-11d2-883f-0016d3cca420
19/06/12 08:49:10 INFO crail: crail.storage.nvmf.allocationsize 1073741824
19/06/12 08:49:10 INFO crail: crail.storage.nvmf.queueSize 64
19/06/12 08:49:10 INFO narpc: new NaRPC server group v1.0, queueDepth 32, messageSize 512, nodealy true
19/06/12 08:49:10 INFO crail: crail.namenode.tcp.queueDepth 32
19/06/12 08:49:10 INFO crail: crail.namenode.tcp.messageSize 512
19/06/12 08:49:10 INFO crail: crail.namenode.tcp.cores 1
19/06/12 08:49:10 INFO crail: connected to namenode(s) /192.168.1.164:9060
19/06/12 08:49:10 INFO crail: CrailHadoopFileSystem fs initialization done..
19/06/12 08:49:10 INFO crail: lookupDirectory: path /
19/06/12 08:49:10 INFO crail: lookup: name /, success, fd 0
19/06/12 08:49:10 INFO crail: lookupDirectory: path /
19/06/12 08:49:10 INFO crail: lookup: name /, success, fd 0
19/06/12 08:49:10 INFO crail: getDirectoryList: /
19/06/12 08:49:10 INFO crail: CoreInputStream: open, path  /, fd 0, streamId 1, isDir true, readHint 0
19/06/12 08:49:10 INFO crail: Connecting to NVMf target at Transport address = /192.168.3.100:4420, subsystem NQN = nqn.2018-12.com.StorEdgeSystems:cntlr13
19/06/12 08:49:10 INFO disni: creating  RdmaProvider of type 'nat'
19/06/12 08:49:10 INFO disni: jverbs jni version 32
19/06/12 08:49:10 INFO disni: sock_addr_in size mismatch, jverbs size 28, native size 16
19/06/12 08:49:10 INFO disni: IbvRecvWR size match, jverbs size 32, native size 32
19/06/12 08:49:10 INFO disni: IbvSendWR size mismatch, jverbs size 72, native size 128
19/06/12 08:49:10 INFO disni: IbvWC size match, jverbs size 48, native size 48
19/06/12 08:49:10 INFO disni: IbvSge size match, jverbs size 16, native size 16
19/06/12 08:49:10 INFO disni: Remote addr offset match, jverbs size 40, native size 40
19/06/12 08:49:10 INFO disni: Rkey offset match, jverbs size 48, native size 48
19/06/12 08:49:10 INFO disni: createEventChannel, objId 140229751834160
19/06/12 08:49:10 INFO disni: launching cm processor, cmChannel 0
19/06/12 08:49:10 INFO disni: createId, id 140229751892832
19/06/12 08:49:10 INFO disni: new client endpoint, id 0, idPriv 0
19/06/12 08:49:10 INFO disni: resolveAddr, addres /192.168.3.100:4420
19/06/12 08:49:10 INFO disni: got event type + UNKNOWN, srcAddress /0.0.0.0:0, dstAddress /192.168.3.100:4420

Hi

Thanks for trying Crail. To narrow down the problem, let me ask some questions.

Is this an Infiniband or a RoCE network?

If I understand correctly, your Spark runs in a container with bridged
network and you see all RDMA devices.

Do you also run Crail in containers or do you run Crail natively on
physical hosts?

If you run Crail directly on physical hosts, does crail fs -ls / work, when
you execute it also on a physical host?

If you do a ib_send_bw test from the container you run Spark to the location,
where you run Crail, does it work? For example:

On the same node, as the Crail namenode runs (the "server"), do the following:
ib_send_bw -R
and in the Spark container, run:
ib_send_bw -R <IP of the "server" above>

Please let me know the outcome.

Thanks
Adrian

ibv_rc_pingpong does not use rdma connection management (which DiSNI is using) but manually changes QP state and uses GID/LID to connect. Can you please try to run rping in your container.

Regards,
Jonas

Check out this article on the MLX homepage: https://community.mellanox.com/s/article/howto-create-docker-container-enabled-with-roce
If you are using RoCE with CM you have to use host network:

Due to RDMA-CM limitations, the container must use the host network name space

We know the pain with Spark and hostnames, there is a way to configure Spark with Yarn to use ips however it requires a Yarn configuration with a hard coded ip on every host (very cumbersome). I'm not an expert running Spark on container but it seems there is native support for containers in Spark, maybe Adrian can provide some more inside. Regarding the RNICs, we mostly run bare metal. When I ran with containers in the past I was using host network. This might also work: https://community.mellanox.com/s/article/docker-roce-macvlan-networking-with-connectx4-connectx5

I didn't know about the SRIOV plugin but again before trying to run Crail please give rping a try. It is part of the rdma CM examples.

Regards,
Jonas

Hi David

Going through the code a bit now, there isn’t any logging or timeouts in this part of the
code (where the binding is attempting). Shouldn’t there at least be a timeout? Java is a
bit foreign to me however… so don’t take offence blush

I did not write this part of the code so I'm also not 100% sure if timeouts are missing or not. However, most of the RDMA CM functions do have a timeout argument, so I assume timeout is handled in the C code.

At line 198, what’s being retuned is a -1 (null), which causes idPriv to be null, and then the while loop at
line 66 of RdmaCmProcessor.java will never stop, thus looking like a hang. I haven’t yet determined why
the null is returned, but I’m hoping it’s not related to the host network. I think this should be part of the
memory allocated in line 188?

Looks like a bug to me.

David, I can only repeat myself: why not try to run some application that uses CM like rping before trying to run DiSNI. This way you can eliminate the possibility of bugs in DiSNI causing the hang.

Regards,
Jonas

Can we debug the simpler case of the RdmaReadServer/RdmaReadClient.

You mentioned you rand the read example. What is the error you get when running this?

server:
java -cp disni-1.7-jar-with-dependencies.jar:disni-1.7-tests.jar com.ibm.disni.examples.ReadServer -a 10.100.0.1

client:
java -cp disni-1.7-jar-with-dependencies.jar:disni-1.7-tests.jar com.ibm.disni.examples.ReadClient -a 10.100.0.1 -p 1919

Also, please try to run rping as Jonas's suggested and let us know the outcome.

Thanks

Ok. So we narrowed it down to a raw RDMA/container problem. @asqasq Do you have further suggestions? I thought you had basic RDMA w CM working in containers..do you?

Hi David

It's good to hear that, it is basically the same as I am seeing, this is why I suggested to run 'ib_send_bw -R' with CM initially.

At least when I tried it, Mellanox stated that it won't work with RoCE networks without the host flag )as I said earlier). We would also like to run it without host network, but apparently this is a known limitation (maybe a known bug).

At least we know now that it is not a DiSNi problem.

So far I dom't have a solution to that. Let's see what Mellanox replies.

Regards
Adrian

David,

Just to clarify, the difference between ibv_rc_pingpong and rping is that the former is manually connecting via GID/LID (using ibv_modify_qp) whereas the latter uses librdmacm to connect. Essentially the RDMA CM core tries to find the appropriate RDMA device by retrieving the MAC address which obviously in a bridged container case will not work. But I know namespace support was added to RDMA CM a while ago. According to this slidedeck from last year macvlan should work with CM (slide 13): http://qnib.org/data/isc2018/roce-containers.pdf

Jonas

Great. Keep us updated if it works.

Jonas

Good to hear! Looks like it could connect to the namenode but not to the datanode (File names are stored in directory files on datanodes). Can you make sure that the datanode is accessible from within the container?

Jonas

Nice! I just noticed that you use the NVMf storage tier. Be aware that blocksize and slicesize have to be multiple of sector size of your SSD. Also directory record entry size has to be sector size otherwise we cannot guarantee atomicity (default is 512 IIRC).

Let us know if you have any more questions.

Jonas

The shuffle plugin is independent of broadcast. There are a few applications that use broadcast extensively, e.g. like SQL otherwise you probably will not see a big difference.

Jonas

No problem. I will close this issue, feel free to open a new one or open a JIRA ticket here: https://issues.apache.org/jira/projects/CRAIL/issues if the problem is Crail related.

Regards,
Jonas

Hi David,

I do remember HDFS adaptor on Crail not being closed properly in Spark runs using Crail as input or output. We should have looked into this long time ago I guess. I think the problem is that Spark can deal with multiple file system objects and keeps them cached and somehow we appear to not catch the close trigger properly.

Would you mind re-posting this on the crail mailing list which would be the right place to discuss those things (see http://crail.incubator.apache.org/community for the mailing list, dev list is the right one). Maybe someone can help there.

Cheers,
Patrick