DAIET RX Queue Map Error
Closed this issue · 7 comments
Hi, I was able to compile PyTorch with Gloo and DAIET and now trying to run the training but I get the following error:
Configuration file daiet-server.cfg
** DAIET parameters **
Num updates: 256
Max num pending messages: 64
Worker port: 4000
PS port: 3030
Worker IP: 14.207.254.149
PS0: XX:XX:XX:XX:XX:XX 14.207.254.165
Num workers: 1
Number of threads: 4
CPU freq: 1.8 GHz
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
Link Up. Speed 200000 Mbps - Full-duplex
Cannot init mbuf pool: Cannot allocate memory
My config file for reference daiet-server.cfg:
[daiet]
num_workers = 1
# Block size
num_updates = 256
# Slots per core
max_num_pending_messages = 64
worker_port = 4000
ps_port = 3030
worker_ip = 14.207.254.149
ps_ips = 14.207.254.165
ps_macs = XX:XX:XX:XX:XX:XX
# Deprecated config, no need to change
sync_blocks = 100000000
[dpdk]
# Cores to bind to
cores = 10-13
prefix = daiet
# Extra EAL options
extra_eal_options = -w 0000:8e:00.0
# Port id
port_id = 0
# Pool and pool cache sizes
pool_size = 131072
pool_cache_size = 512
# Number of packets in a burst
burst_rx = 64
burst_tx = 32
# Bulk drain timer (microseconds)
bulk_drain_tx_us = 10
Yes I did follow that guide, since its a Mellanox interface I did not bind it to igb_uio driver. The DPDK example applications also seem to run perfectly on their own.
Did you set up hugepages? How large of a memory did you allocate to hugepages? Did you try to increase it?
Yes I did with:
echo 1024 > /sys/devices/system/node/node0/hugepages/hugepages-64kB/nr_hugepages
echo 1024 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /mnt/huge; mount -t hugetlbfs pagesize=1GB /mnt/huge
Haven't tried increasing the size as this seemed enough, will try now to rule it out.
Increasing to 4GB fixed it!!! Thanks!!!
A small general question though, is there any component in DAIET to ensure reliable delivery and/or how does it know when to retransmit?
You can find that in Appendix A of our paper.