It's not best practise to have keepalive running over ISL at all. I would avoid that at all times.
I never releases I tend to use the mgmt interface for keepalive, preserving switchports.
From the ACSX study guide it states:
In a routed design for keepalive northbound you should implement a full mesh topology with cross connection, each aggregation switch my have physical connection to both cores.
Suppose you use a non-mesh topology and aggregation switch only have a single connection to core. If a link fails between agg and core, routing convergence must occur impacting traffic. Tweaking timers will help. So I guess thats what you have done and found a feasible solution.
Thanks for posting!
------------------------------
Ole Morten Kårbø
ACEA ACSP
Netnordic Norway
------------------------------
Original Message:
Sent: Jun 15, 2023 06:58 PM
From: Campbell
Subject: VSX routed keep alive design learnings
I've recently had to lab up a new VSX design with routed VSX keep alives. This was on the 8360 using 10.10 OS. The documentation for this setup was very sparse with only Appendix D in the 2020 1.3 version of the 'VSX Configuration Best Practices for Aruba CX' making mention of it but missing out a few key considerations. It took me a few days to figure out how to set it up correctly so that it didn't suffer from extended user traffic outages under certain failure conditions.
My design situation was that I needed my VSX switches split across two different comms rooms. The 8360s are my site L3 core/agg switches and each comms room had a WAN router. The WAN routers are independent of each other, and I don't have the 8360 connecting to both WAN routers only the local WAN router. So a typical campus type deployment scenario but without the core switch pair that Appendix D talks about.
Rather than OSPF I needed to use BGP so it was eBGP to the WAN routers while running iBGP between the VSX switches over the ISL link, BFD configured for all peerings. I have multiple VRFs. Because of a common duct between the main comms room I needed a routed VSX keep alive design using the switch loopbacks so that under normal conditions keep alive packets routed over the ISL links but if those failed it would re-route via the WAN. That re-routing needs to happen within 3 second to avoid split brain.
I was discovering that failing the ISL links or powering off one of the VSX switches would result in split brain and traffic outage for the end user of up to 7 seconds. I finally tracked this down to keep alive traffic not re-routing quickly enough and there were two mistakes in my design. First I played with changing the default 'bfd detect-multiplier' changing it to 2 and this improved the fail over time down to 3s or so. But I realised the root cause of this fault was the VLAN point-to-point interface between both switches that iBGP operates over was not going down when the ISL link was failed. That was because I had another LAG interface on the switches allowing all VLANs. That LAG was not going down so my P2P VLAN2 was staying up. So BGP needed to time out before withdrawing the route for the keep alive remote loopback.
With the below configurations I was able to get sub second fail over for loss of ISL links or power down of SW1. Upstream WAN router failure took approx 1-2 sec to recover and re-route.
I'm trying to get my design validated by Aruba TAC, however that doesn't seem to be going very well with my TAC engineer a little out of his depth. So I thought I'd post it here to get feedback.
VSX Switch #1
VSX Switch #2