The #1 reason we see for controllers losing connectivity in the field is too much traffic on their subnet(broadcast/multicast) for example. Controllers have a firewall that will attempt to drop traffic if it exceeds certain boundaries. "show firewall | include Rate" will show the rates above which traffic is limited:
(Babarella) #show firewall | include Rate
Policy Action Rate Port
Rate limit CP untrusted ucast traffic Enabled 9765 pps
Rate limit CP untrusted mcast traffic Enabled 3906 pps
Rate limit CP trusted ucast traffic Enabled 65535 pps
Rate limit CP trusted mcast traffic Enabled 3906 pps
Rate limit CP route traffic Enabled 976 pps
Rate limit CP session mirror traffic Enabled 976 pps
Rate limit CP auth process traffic Enabled 976 pps
Rate limit CP vrrp traffic Enabled 512 pps
Rate limit CP ARP traffic Enabled 3906 pps
Rate limit CP L2 protocol/other traffic Enabled 1953 pps
Rate limit CP IKE traffic Disabled
"show datapath bwm" will tell you if that traffic has even been exceeded (the policed" column.
(Babarella) #show datapath bwm
Datapath Bandwidth Management Table Entries
-------------------------------------------
Contract Types :
0 - CP Dos 1 - Configured contracts 2 - Internal contracts
------------------------------------------------
Flags: Q - No drop, P - No shape(Only Policed),
T - Auto tuned
--------------------------------------------------------------------
Rate: pps - Packets-per-second (256 byte packets), bps - Bits-per-second
--------------------------------------------------------------------
Cont Avail Queued/Pkts
Type Id Rate Policed Credits Bytes Flags CPU Status
---- ---- --------- ---------- ------- ------------ ------- ------- ----------
0 1 9792 pps 0 305 0/0 4 ALLOCATED
0 2 3936 pps 0 123 0/0 4 ALLOCATED
0 3 65536 pps 0 2048 0/0 4 ALLOCATED
0 4 3936 pps 0 123 0/0 4 ALLOCATED
0 5 992 pps 0 31 0/0 4 ALLOCATED
0 6 992 pps 0 31 0/0 4 ALLOCATED
0 7 992 pps 0 31 0/0 4 ALLOCATED
0 8 512 pps 0 16 0/0 4 ALLOCATED
0 9 3936 pps 0 123 0/0 4 ALLOCATED
0 10 1984 pps 0 62 0/0 4 ALLOCATED
Long story short:
- Make sure the management subnet of your controllers are not in large broadcast subnets. In addition, avoid putting APs directly on your MD or MM management subnet; when broadcast and ARP traffic spikes, the MD or MM will protect itself by dropping useful traffic like VRRP and ARP necessary to communicate with outside components
- If you can, make the management subnet of your MM different from the management subnet of your MDs, so that the VRRP/Broadcast traffic are on separate subnets to avoid the same issue as above.
- Enable bcmc-optimization on all VLANs to drop unnecessary traffic so that the controller does not consume cycles attempting to process it
- Make sure that you are only trunking VLANs from your switch to your controller that the controller will be actually using. Traffic that is on unnecessary VLANs will force the controller to process that traffic and it will be policed if it goes over a threshhold. Also conversely, don't enable VLANs on an MD that will not be used by that MD for the same reason; the controller will have to process traffic that will never be seen or used by the MD for no reason.
- Do not force any unnecessary physical redundancy - Don't feel the need to dual-connect controllers to different switches. You could introduce an inadvertent loop and that will either completely or partially blackhole your controller
- Do not span MDs in a cluster physically far from each other - If the latency between the two MDs in a cluster increases due to the distance between them, it will generate a "split brain" situation where you will not know what MD has controll over what APs. This can easily happen again due to alot of traffic being generated and MDs in a cluster being far away from each other. Try to avoid this design so that AP redundancy is predictable.
**All of these tips will not eliminate your MDs and MMs losing connectivity, but they will (1) decrease the likelihood and (2) more easily allow you to understand and issue when you encounter one*****