I'm going to add some more information to this thread.
We have also been seeing Radius server uptime issues on the ArubaOS cli. For all of our local controllers in EMEA, the uptime will keep resetting. It is now very likely that this is a client configuration issue. The default TTL on each request in ArubaOS is 3x10. If after the third attempt, no response from the Radius server (MS NPS, 2008) is received, the controller marks the server as down, default 10mins. This is regardless of whether there is other radius traffic passing, any one single 3x10 failure will mark the server down. Additionally, if the same problem is seen on the fail through server, the primary server is brought back in to service. Consequently when troubleshooting, the auth debug output can be full of:
Apr 4 13:42:49 authmgr[1710]: <124004> <DBUG> |authmgr| Auth server <servername>' response=2
Apr 4 13:42:49 authmgr[1710]: <124014> <NOTI> |authmgr| Taking Server <servername> out of service for 10 mins
Apr 4 13:42:51 authmgr[1710]: <124004> <DBUG> |authmgr| server=<servername>, ena=1, ins=0 (1)
Apr 4 13:43:00 authmgr[1710]: <124004> <DBUG> |authmgr| server=<servername>, ena=1, ins=0 (1)
/snip
Then:
Apr 4 13:43:24 authmgr[1710]: <124015> <NOTI> |authmgr| Bringing Server <servername> back in service.
It seems the cause of this is clients using EAP instead of PEAP.
What we see on the Radius logs are:
Authentication Details:
Connection Request Policy Name: 1-Secure Wireless Connections Aruba
Network Policy Name: Secure Wireless Connections Aruba London
Authentication Provider: Windows
Authentication Server: <servername>
Authentication Type: EAP
EAP Type: -
Account Session Identifier: -
Reason Code: 1
Reason: An internal error occurred. Check the system event log for additional information.
A successful request receives:
Authentication Details:
Connection Request Policy Name: 1-Secure Wireless Connections Aruba
Network Policy Name: Secure Wireless Connections Aruba London
Authentication Provider: Windows
Authentication Server: <servername>
Authentication Type: PEAP
EAP Type: Microsoft: Secured password (EAP-MSCHAP v2)
Account Session Identifier: -
Quarantine Information:
Result: Full Access
Extended-Result: -
Session Identifier: -
Help URL: -
System Health Validator Result(s):
We can see this in Wireshark with constant failures to reply to the controller from the Radius box. Increasing the retransmits will not solve this, the server will still not reply.
Moral of the story here - Get a Wireshark capture and press server admins to scrutinise the server logs.