Evan,
Colin's right on here - If I read things correctly, the initial failure scenario are lost heartbeats from AP to controller, causing the failover.
when the APs fail over to the master controller, the APs show "D" dirty flag, which means they have not received
a complete configuration.
Reviewing your post, a few notes:
These indicate the AP has missed heartbeats, and bootstraps.
Aug 6 08:56:16 :311004: <WARN> |AP RIDOA_AP65.11@158.123.114.148 sapd| Missed 25 heartbeats; rebootstrapping
It sounds like there is some intermittent connectivity issue between a number of APs and the controller(s)
where they are currently connected.
These messages are indicating frame size issues:
Aug 17 20:57:25 KERNEL: 0:<7>UDP: short packet: From 255.255.255.255:8211 1621/1517 to 129.2.139.140:8419
Aug 17 20:57:27 KERNEL: 0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
Aug 17 20:57:37 KERNEL: 0:<7>UDP: short packet: From 0.0.0.0:8211 1621/245 to 10.200.200.2:8419
Aug 17 20:57:37 KERNEL: 0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
There is a crash present, I'd gather the "tar crash" on the controller CLI, which will will retrieve any crash data present.
Aug 17 20:57:42 :303073: <ERRS> |nanny| Process /mswitch/bin/stm [pid 3077] died: got signal SIGSEGV
Aug 17 20:57:45 :399803: <ERRS> || An internal system error has occurred at file mon_client.c function mon_client_send_query line 114 error PAPI_Send to 8456 failed: Connection refused.
Aug 17 20:57:49 :303029: <ERRS> |nanny| Process /mswitch/bin/stm [pid 3077]: crash data saved in dir /flash/crash/process/8-17-2013@20-57-42/stm
Aug 17 20:57:50 KERNEL: 0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
Aug 17 20:57:50 KERNEL: 0:<7>UDP: short packet: From 0.0.0.0:8211 1621/245 to 10.200.200.4:8419
Aug 17 20:57:50 KERNEL: 0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
Aug 17 20:57:53 :303079: <ERRS> |nanny| Restarted process /mswitch/bin/stm, new pid 7139
Aug 17 20:57:53 :303025: <ERRS> |nanny| Found core file /tmp/core.3077.stm.A72xx_38532, 65011712 bytes, compressing...
These messages are indicating some packet length issue, it sounds as if the MTU is limited between the AP and the controller, which may affect controller to AP configuration communications.
Aug 17 20:58:04 :304001: <ERRS> |stm| Unexpected stm (Station management) runtime error at handle_ap_statistics, 1019, Length mismatch expected 1527 received 1387 from AP with eth_mac 00:24:6c:c9:96:34, and phy_type is 1
I'd suggest looking at the initial event causing the missed hearbeats, and also look at the MTU along the path between AP and master controller.