OK - got a strange one here that we are getting nowhere fast with - even with TAC.
We have 2 x 3400 Controllers - <Master and Local - with AP's deployed to Local using Primary LMS with the Master Controller as the backup LMS.
Once, maybe twice a day (no consistency) we see all the AP's swing to the Master controller - stay there for the hold down period of 10 minutes and then swing back.
We started debugging and found the following:
Using show ap debug system-status ap-name AP Name we found the following from this morning:
Rebootstrap Information
-----------------------
Date Time Reason (Latest 10)
--------------------------------------
2012-08-03 16:49:38 Switching to primary LMS 10.1.80.5
2012-08-06 10:25:01 Switching to LMS 10.1.80.3 (sapd_check_hbt)
2012-08-06 10:25:09 Broken tunnel
2012-08-06 10:25:14 Broken tunnel
2012-08-06 10:25:29 Broken tunnel
2012-08-06 10:25:39 Broken tunnel
2012-08-06 10:25:51 Broken tunnel
2012-08-06 10:36:58 Switching to primary LMS 10.1.80.5
2012-08-07 08:44:53 Switching to LMS 10.1.80.3 (sapd_check_hbt)
2012-08-07 08:56:09 Switching to primary LMS 10.1.80.5
Rebootstrap LMS
---------------
(none found)
------------
Crash Information
-----------------
(none found)
------------
Heartbeat Stats
---------------
Heartbeats Sent Heartbeats Received
--------------- -------------------
910467 902577
Obviously looks like heartbeats have failed for the 8 consecutive tries and then AP has swapped to the Master.
Did a show log network on the local controller and find:
Aug 7 08:44:51 :208006: <INFO> |fpapps| Changing the vlan 20 state to UP from DOWN
Aug 7 08:44:51 :208045: <DBUG> |fpapps| Received event 3 for Interface 320
Aug 7 08:44:51 :208043: <DBUG> |fpapps| Nim received event L7_UP for interface 320 linkState 3
Aug 7 08:44:51 :208004: <DBUG> |fpapps| Dot1q Change Call back is called 320 event L7_UP (3)
Aug 7 08:44:51 :208044: <DBUG> |fpapps| Nim Interface 320 state change notification, new state L7_FORWARDING
Aug 7 08:44:51 :208045: <DBUG> |fpapps| Received event 6 for Interface 320
Aug 7 08:44:51 :208043: <DBUG> |fpapps| Nim received event L7_FORWARDING for interface 320 linkState 3
Aug 7 08:44:51 :208004: <DBUG> |fpapps| Dot1q Change Call back is called 320 event L7_FORWARDING (6)
Aug 7 08:44:52 :204229: <DBUG> |pim| Received IP multicast interface VLAN VLAN Up message for VLAN 20
Aug 7 08:44:52 :208045: <DBUG> |fpapps| Received event 6 for Interface 320
Aug 7 08:44:52 :208043: <DBUG> |fpapps| Nim received event L7_FORWARDING for interface 320 linkState 3
Aug 7 08:44:52 :208004: <DBUG> |fpapps| Dot1q Change Call back is called 320 event L7_FORWARDING (6)
Aug 7 08:44:52 :204229: <DBUG> |pim| Received IP multicast interface VLAN VLAN Up message for VLAN 20
Aug 7 08:45:43 :208008: <INFO> |fpapps| No change in the Vlan Interface 200 state UP Vlan Interface has tunnels configured
Aug 7 08:45:46 :208007: <INFO> |fpapps| Vlan interface 20 state is DOWN
Aug 7 08:45:46 :208008: <INFO> |fpapps| No change in the Vlan Interface 200 state UP Vlan Interface has tunnels configured
Aug 7 08:47:50 :202085: <DBUG> |dhcpdwrap| No arp entry for ip address 192.168.207.160 eth1.200
And we find similar entries at the other times that the AP's have swung.
On the surface it appears that the VLAN 20 on the local controller with IP 10.1.80.5 (as the primary LMS IP) is going down and the AP's switch to the backup....but somethings don;t make sense:
1. Firstly - why is the VLAN going down - this is a VLAN assigned to a Port and should not go down even with no clients connected - correct?
2. At 8:44am - when the AP's rebootsraps - the network logs shows the VLAN as going from DOWN TO UP - not UP to DOWN as would be expected ( i think) if the VLAN went down. At 8:45 - the VLAN is then reported as DOWN - this seems back to front to me - can anyone shed any light on this?
3. If the VLAN interface 20 did in fact go down for 8 seconds - would we not expect to see a time out on the IP interface of 10.1.80.5? We pinged it constantly during this period of (several hours before and several after) and not 1 packet loss.
Initially TAC has suggested we have a congested network - but we are on term break - there is roughly 100 people on campus as opposed to 3000 - and very little traffic. We have also recently updated our core switch which the Controllers are connected to - and NOTE - the issues was occurring both before and after the switch upgrade.
With ALL the AP's switching at once and the network log - to me this points to the controller having an issue...but not sure where to dig next.
Anyone with any ideas?
Cheers
Wally
#3400