The majority of our campus APs failed over to our backup LMS IP (master controller) and I'm having a difficult time understanding how this happened. It resulted in a large wireless outage which is why I'm trying to find the root cause. I see the following message logged for APs that failed over:
Rebootstrap Information
-----------------------
Date Time Reason (Latest 10)
--------------------------------------
2013-03-08 10:22:47 Switching to LMS 10.X.X.9. Send failed in function sapd_check_hbt. Last Ctrl message: BW_REPORT len=150 dest=10.X.X.10 tries=1 seq=14549
TAC said this message indicates the AP heartbeats to the controller were missed, resulting in a failover. I can't find any indication that we had network problems, either in our core infrastructure or with the controller, or links flapping. All systems have been up, no topology changes, no interface errors. I don't see how the heartbeats could've been missed after confirming all this.
Anyone have thoughts on this?