Sample Header Ad - 728x90

Infiniband interface doesn't route IPoIB traffic

1 vote
1 answer
1606 views
I have blocks of hosts that I'm provisioning using Puppet in exactly the same way, they have identical hardware (same blade chassis), and are definitely connected in all the same ways where interfaces on some are not working the same as others. These are all Infiniband interfaces, so I'm able to test them with commands like ibping and ibsysstat, which shows that they have working UVERBS/RDMA connections. For example: master# ibsysstat 29 sysstat ping succeeded where the node with that LID that isn't working quite right has: node10# ibstat CA 'mlx4_0' CA type: MT4099 Number of ports: 1 Firmware version: 2.11.1250 Hardware version: 1 Node GUID: 0x... System image GUID: 0x... Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 29 LMC: 0 SM lid: 26 Capability mask: 0x02594868 Port GUID: 0x... Link layer: InfiniBand but, when I just do a simple ping to the IPoIB IP address it sits there not connecting. Other commands like ibping are also definitely passing traffic, and data shows up when adding -d showing debug output. I can see the pings go out when I watch the interface using tcpdump, but nothing coming in. Meanwhile, right next to it is a host with the same everything that works just fine. The routing tables all like right to me also, and match hosts that work. On a host that doesn't work: default via 10.10.0.1 dev em1 proto dhcp metric 100 10.10.0.0/24 dev em1 proto kernel scope link src 10.10.0.110 metric 100 10.11.0.0/24 dev ib0 proto kernel scope link src 10.11.0.110 169.254.0.0/16 dev ib0 scope link metric 1005 and on one that does: default via 10.10.0.1 dev em1 proto dhcp metric 100 10.10.0.0/24 dev em1 proto kernel scope link src 10.10.0.108 metric 100 10.11.0.0/24 dev ib0 proto kernel scope link src 10.11.0.108 169.254.0.0/16 dev ib0 scope link metric 1004 The only thing different is the metric in the last route, but that shouldn't matter. Also of note, these hosts worked before they were reprovisioned. So I'm almost positive it's not hardware. I'm at a bit of a loss now and any ideas would be appreciated. Edit: Update with dmesg error I found something in the output of dmesg for the interface in question that only exists on the hosts that don't work. The error ib0: failed to modify QP to RTR: -22 unfortunately this isn't very helpful, and there's not much that comes up related in searches. Perhaps also worth noting, the hosts in question can ping the switch IP address, and the switch can ping the hosts on their associated IPs.
Asked by geoffjay (123 rep)
Aug 16, 2018, 08:41 PM
Last activity: Aug 27, 2018, 03:57 PM