Wireless Access

Reply
Frequent Contributor I

LMS Failover Process Debug Log Inquiry?

Good Afternoon,


I currently have a TAC Case opened, but I'm curious about my intrepretation of the AP DEBUG SYSTEM-STATUS I included down below.

 

We currently have a Master-Local (7240s - 6.4.2.10) setup with all AP-225s terminated to the Local Controller (192.168.1.4) with the Master Controller (192.168.1.3) set as the Backup LMS. Each controller has dual 10G Uplinks to our wirelssVSS pair (VSS-1 and VSS-2) via separate port-channel. A few weeks ago, VSS-1 failed over to VSS-2, which brought down *one* of the 10G uplinks on each controller (about 5 minutes) - however, the secondary uplink was operational the entire time. The end-result was about 60% of our access points failed over to the master controller 192.168.1.3), while the remaining 40% remained on the local (192.168.1.4). I'm trying to understand the debug system-status on the access points down below, since it appears all the APs were getting ready to switch to the Master. We currently have preemption turned off, with the default heartbeat of 8 seconds.

 

*APs Remained on Local Controller*

2015-08-06 07:37:52 Switching to LMS 192.168.1.3: Missed heartbeats: Last Sequence Generated=9116 Sent=9116 Rcvd=9108. Last Ctrl message: KEEPALIVE len=45 dest=192.168.1.4 tries=1 seq=566
2015-08-06 07:38:05 New connection, Changing to LMS (192.168.1.4) [cur_lms_index: 0, event: REDUN_EVENT_TUNNEL_UP, cur_state: REDUN_STATE_TUNNEL_LMS, function: redun_tunnel_up(5267)]

 

Is the above signifying that it missed a couple heartbeats, and is preparing to switch to 192.168.1.3 if the threshold is hit? Or that the 8 seconds has already been hit? I was interested in this message since the APs that remained on the Local Controller – the time difference between (Switching to LMS and New Connection messge) appeared to be about 20 seconds or less. While as the ones that failed over seemed to be slightly longer – 23 seconds – 32 seconds.

 

*APs Failed to Master Controller*
2015-08-06 07:37:52 Switching to LMS 192.168.1.3: Missed heartbeats: Last Sequence Generated=9115 Sent=9115 Rcvd=9107. Last Ctrl message: KEEPALIVE len=45 dest=192.168.1.4 tries=1 seq=568
2015-08-06 07:38:15 New connection, Changing to LMS (192.168.1.3) [cur_lms_index: 1, event: REDUN_EVENT_TUNNEL_UP, cur_state: REDUN_STATE_TUNNEL_LMS, function: redun_tunnel_up(5267)]

Guru Elite

Re: LMS Failover Process Debug Log Inquiry?

Do you also have HA configured?  Can you post your redundancy configuration and your topology?



Colin Joseph
Aruba Customer Engineering

Looking for an Answer? Search the Community Knowledge Base Here: Community Knowledge Base

Frequent Contributor I

Re: LMS Failover Process Debug Log Inquiry?

Hi Colin,

 

Thanks for your time. We do NOT have HA configured. Actually, while attending one of the Aruba Training Sessions by David Westcott, we discovered we didn't have the "Hot Standby" *Local to Master* (sorry if terminology incorrect) configuration as we had believed. We were looking into which redundancy option was the best (VRDs/Presentations/Discussions), but there were a lot of opinions on which option was "best". After consulting with our Aruba Rep and since redundnacy wasn't a real urgency yet (Slow Migration from Meru to Aruba Campus Wide), we went with LMS/LMS Backup (easiest/quickest) to get something in place till we could properly evaluate our options to choose what is best for our fully deployed environment.

 

I believe this is the only "redundnacy" configuration that is technically in place, from the AP System Profile.

ap system-profile "ISU Local"
mtu 1499
lms-ip 192.168.1.4
bkup-lms-ip 192.168.1.3
!

 

I've attached a basic topology of our Master/Local Controllers, and their 10G (Port-Channeled) uplinks to our WirelessVSS Pair. When we tested this fail-over, we had disabled Po60 (Te1/3/1 and Te2/3/2) and they all failed-correctly. What we DID Not account for, was disabling only one of the member ports of Po60 (Te1/3/1) which is essentially what went down for 5 minutes. We didn't consider testing this since last year - disabling one of the uplinks resulted in a continuous ping to the Local Controller taking a second longer (2ms), and then returning back to (1ms).

 

Master Controller

  • 0/0/2 10G Uplink to VSS-2 (Port-Channel 61)
  • 0/0/3 10G Uplink to VSS-1 (Port-Channel 61)

Local Controller (550 APs terminated)

  • 0/0/2 10G Uplink to VSS-1 (Port-Channel 60)
  • 0/0/3 10G Uplink to VSS-2 (Port-Channel 60)

We have a maintenance window coming up in a couple weeks, and we'll be testing another fail-over by disabling one of the uplinks this time - Suggestion by TAC for a live debug session to try and reproduce why some APs failed over and some did not - he suspected network congestion at the time.

Guru Elite

Re: LMS Failover Process Debug Log Inquiry?

On the face of it, that configuration looks sound, but there could be other variables, like configuration, etc on both sides that would potentially close a path and then re-open it, and create only a partial failure. 

 

In my opinion, it is better to make any failure cause ALL devices go over to controller#2 on the first equipment or physical failure.  Anything else that you do in the middle could cause only a  partial failure and cause variability.  With regards to controllers, you only want 2 situations (1) access points on the primary controller or (2) access points on the secondary controller.  You don't want the potential for flaps or partial outages to create variability.  The Aruba back-of-the-napkin math is 1 gigabit ethernet connection for each 100 access points.  Since you have 500 on each, a single 10-gig connection should suffice for each in separate MDFs and provide the diversity you need, as well.

 

 The LMS-IP and Backup LMS-IP is designed so that the controllers don't have to be on the same subnet, so you could geographically separate them so that their infrastructure is not intertwined, to provide a definitive and seamless failover.  You could use named VLANs on the Master and local controller so you do not have to drag VLANs across your campus; you can supply local VLANs to clients.

 

"Hot Standby" configuration would potentially supply a quicker failover, because access points build a tunnel to the backup or standby controller, but it is more complicated to configure and the difference in time savings between link down and link up might not be enough time for most people to call you to report the outage.

 

If you have a network outage, you want to be able to count on something being only X or Y, not many other things.  Anything more complicated would just add to your troubleshooting, on top of everything you have to already deal with..

 

That is 100% totally my opinion.



Colin Joseph
Aruba Customer Engineering

Looking for an Answer? Search the Community Knowledge Base Here: Community Knowledge Base

Frequent Contributor I

Re: LMS Failover Process Debug Log Inquiry?

Thanks for your insight Colin. We'll need to take a good look at our redundnacy options. The AP flapping has so far been minor (2-4 APs), but it is/will become annoying in the future.

 

I did have another question. We only discovered the "large" fail-over occurence by chance when I noticed a significantly large increase of clients on the Master Controller vs the Local Controller in Airwave. Is there any type of notification I could configure in Airwave to alert on when an AP performs an LMS-Fail-Over? When I asked the TAC, he said only when we're running in debug mode.

 

Actually, I located the device trigger I was looking for in a previous post. This is such a helpful and very resourceful community. :-) http://community.arubanetworks.com/t5/Unified-Wired-Wireless-Access/Trigger-email-for-AP-on-backup-lms/m-p/50244#M21096

Search Airheads
cancel
Showing results for 
Search instead for 
Did you mean: