Problem:
Failover outages and pitfalls on IAP VPN deployments
Problem Statement & Scenario
when failing the primary broadband, the SD-Branch tunnels go down and come up on the backup and the routing mostly works, pretty quickly. But, the IAP-VPN tunnel stays up on the primary (as well as the new tunnel on the backup) VPNC for 5 or 6 minutes – and while it is up still on the primary, the VPNC continues to advertise a route to the L3 networks – these routes don’t work and we see an outage for 5-6 minutes.
Diagnostics:
•IAP-VPN tunnel stays up on the primary VPNC (as well as the new tunnel on the backup) for 5 or 6 minutes – and while it is up still on the primary, the VPNC continues to advertise a route to the L3 networks – these routes may not work, and we would expect an outage for 5-6 minutes.
•This illustrates 30x10 = 300 seconds for the IAP to detect the primary uplink is down and initiate switch to backup uplink. So the primary uplink tunnel shows as up on VPNC for about 5 minutes after the primary uplink went down - since IAP is the tunnel initiator, VPNC does not monitor the tunnel status and continues to send route advertisements during this period.
•failover-internet-pkt-lost-cnt 10 à This is the number of ICMP packets that are allowed to be lost to determine if AP must switch to a different uplink connection
•failover-internet-pkt-send-freq 30 à ICMP packets are sent once every 30 seconds
•IAP forms IPSec tunnel to the VPNC again, registers the branch and then the OSPF/BGP route advertisements will point to the new tunnel. This explains the outage for 5-6 minutes
Solution
Tips & Tricks for IAP failover best practices config tweaks
Failover outages and pitfalls on IAP VPN deployments
•IAP-VPN tunnel stays up on the primary VPNC (as well as the new tunnel on the backup) for 5 or 6 minutes – and while it is up still on the primary, the VPNC continues to advertise a route to the L3 networks – these routes may not work, and we would expect an outage for 5-6 minutes.
•This illustrates 30x10 = 300 seconds for the IAP to detect the primary uplink is down and initiate switch to backup uplink. So the primary uplink tunnel shows as up on VPNC for about 5 minutes after the primary uplink went down - since IAP is the tunnel initiator, VPNC does not monitor the tunnel status and continues to send route advertisements during this period.
•failover-internet-pkt-lost-cnt 10 à This is the number of ICMP packets that are allowed to be lost to determine if AP must switch to a different uplink connection
•failover-internet-pkt-send-freq 30 à ICMP packets are sent once every 30 seconds
•IAP forms IPSec tunnel to the VPNC again, registers the branch and then the OSPF/BGP route advertisements will point to the new tunnel. This explains the outage for 5-6 minutes
Uplink & VLAN Default config UI
Root cause for example in this case of VRRP
–On the Gateway, default timeout for IPsec tunnel down for IAP tunnel is 5 min. After this timeout, tunnel is removed from ipsec crypto table, auth notifies IAP manager and IAP manager removes the datapath routes.
– VPNC1 is disconnected from WAN link (still connected to core switch) and VRRP failover happens to VPNC2.
–IAP is setting up tunnel to VPNC2 around 20-30 seconds after VRRP failover and registering with it but may taking 5 minutes for VPNC1 to remove the routes and stop advertising the routes upstream.
– VPNC2 should be advertising routes for the IAP during this time but the upstream routers may not update their route table until VPNC1 stops advertising the routes
user-idle-timeout” in “default-iap” VPN authentication profile.
*[mynode] (config) #aaa authentication vpn default-iap
*[mynode] (VPN Authentication Profile "default-iap") #
user-idle-timeout User idle timeout value. Valid range is 30-15300 seconds in multiples of 30 seconds
Config Tweaks Tips Recommended
IAP side
Default
–failover-internet-pkt-lost-cnt=10
–failover-internet-pkt-send-freq=30
–failover-internet-check-timeout=300 à This is ICMP packet timeout, default is 10 seconds
–Make sure preemption enabled on fail over.
Optimized settings
–failover-internet-pkt-lost-cnt=6
–failover-internet-pkt-send-freq=5
–failover-internet-check-timeout=4 à This is ICMP packet timeout, default is 10 seconds
Controller side
*[mynode] (config) #aaa authentication vpn default-iap
*[mynode] (VPN Authentication Profile "default-iap") #
user-idle-timeout User idle timeout value. Valid range is 30-15300 seconds in multiples of 30 seconds
Set to 30 secs
From 10.x we got better way of handling failover as that would be hitless with cluster being built and configured on headend VPNC side.