Wireless Access

last person joined: 16 hours ago 

Access network design for branch, remote, outdoor, and campus locations with HPE Aruba Networking access points and mobility controllers.
Expand all | Collapse all

Source of RADIUS timeouts?

This thread has been viewed 22 times
  • 1.  Source of RADIUS timeouts?

    Posted Oct 22, 2012 09:30 AM

    Greetings-

     

    I'm trying to get the bottom of a RADIUS issue with my Aruba deployment. I have two sites and each site has a 3600 controller on the latest firmware. The controller at my primary site is a Master and the other controller at the other site is a Local. Each site has a Server 2008R2 using the built-in NPS for RADIUS. I'm using PEAP and terminating at the RADIUS servers. Using ADCS I made a self-signed certificate for each RADIUS server and pushed out the cert to all my domain clients. The sites have different subnets and traffic between them is routed. In a broad sense all this is working fine.

     

    I have both of the RADIUS servers I mentioned in a Server Group. The RADIUS server at my primary site is on top and Fail Through is not checked on the Server Group options.

     

    My problem is that the Master controller decides that the first RADIUS server is down and sends clients to the second one. Then if they're lucky the client will prompt them to accept the second RADIUS' servers certificate - if they're unlucky they have to disjoin & rejoin the wireless network, at which point they're prompted to accept the second servers' RADIUS servers certificate.

     

    I figured this was a layer 2 or 3 problem but I cannot seem to find an issue anywhere. The Master controller and the primary RADIUS server are on the same local subnet. They're even on the same switch. I do not see any errors on my switch. I don't see any timeout errors on my NPS RADIUS logs. I ran a continual ping from the RADIUS server to the local controller and didn't drop a single packet while things were working correctly AND while they weren't. The NPS server was brought up specifically for this wireless deployment and it is not doing anything else at the moment. Looking through the NPS logs it's serving/denying clients maybe every couple seconds so it can't be overloaded.

     

    #show aaa authentication-server radius statistics from 7:28A this morning:

     

    RADIUS Server Statistics
    ------------------------
    Statistics             Primary  Secondary
    ----------             -----    ------
    Accounting Requests    0        0
    Raw Requests           266      0
    PAP Requests           0        0
    CHAP Requests          0        0
    MS-CHAP Requests       0        0
    MS-CHAPv2 Requests     0        0
    Mismatch Response      0        0
    Bad Authenticator      0        0
    Access-Accept          23       0
    Access-Reject          0        0
    Accounting-Response    0        0
    Access-Challenge       243      0
    Unknown Response code  0        0
    Timeouts               0        0
    AvgRespTime (ms)       10       0
    Total Requests         266      0
    Total Responses        266      0
    Uptime (d:h:m)         0:0:30   0:0:30
    SEQ Total/Free         255/255  255/255

     

    #show aaa authentication-server radius statistics 2 minutes later:

    RADIUS Server Statistics
    ------------------------
    Statistics             Primary  Secondary
    ----------             -----    ------
    Accounting Requests    0        0
    Raw Requests           295      14
    PAP Requests           0        0
    CHAP Requests          0        0
    MS-CHAP Requests       0        0
    MS-CHAPv2 Requests     0        0
    Mismatch Response      0        0
    Bad Authenticator      0        0
    Access-Accept          23       1
    Access-Reject          0        0
    Accounting-Response    0        0
    Access-Challenge       271      13
    Unknown Response code  0        0
    Timeouts               4        0
    AvgRespTime (ms)       10       196
    Total Requests         295      14
    Total Responses        294      14
    Uptime (d:h:m)         0:0:0    0:0:32
    SEQ Total/Free         255/255  255/255

     

    Approximately 41 minutes later the Master controller decided the Primary RADIUS server was back up and started sending traffic to it. This entire time the RADIUS server was continually pinging the Master controller without dropping a packet. As I mentioned above, the NPS logs don't show anything unusual right up to and during this event.

     

    Any ideas?

     

     

     

     


    #3600


  • 2.  RE: Source of RADIUS timeouts?

    Posted Oct 23, 2012 12:47 PM

    You can enable "logging level informational security subcat aaa". It will show you which requests are getting timedout.



  • 3.  RE: Source of RADIUS timeouts?

    Posted Oct 23, 2012 12:52 PM

    Thanks for your reply!

     

    I cleared out my RADIUS statistics with #clear aaa authentication-server radius statistics and issued logging level informational security subcat aaa. Now time to sit back and wait for it to happen again :smileyindifferent:

     

    I'll update this thread with my results.



  • 4.  RE: Source of RADIUS timeouts?

    Posted Oct 23, 2012 12:58 PM

    Once you see requests timed out in the logs, check "show auth tracebuf" output on the controller as well as check your RADIUS server logs and see what is happening. Is request even reaching to the server? If so, is server replying? or is discarding the request? 

     

    Let's follow this step. If this doesn't help than I would suggest you to take controller internal packet captures on udp port 1812 to see where is the drop. But let's do one step at a time.



  • 5.  RE: Source of RADIUS timeouts?

    Posted Oct 24, 2012 02:18 PM

    I did not have any problems at all yesterday. I cleared my counters at the beginning of the morning and all traffic was sent to my Primary RADIUS server for the entire day.

     

    Unfortunately today is a different story.

     

    I looked at show aaa authentication-server radius statistics this morning and noticed both my Primary and Secondary RADIUS server reported an uptime of 1h58m (which it should have been at least a day)

     

    I took a look and sure enough requests started being directed at my Secondary RADIUS server around that timeframe.

     

    Checking the Server 2008R2 logs, I don't see any outward indication of network or NPS troubles. I've noticed a trend of about 6-8 requests per minute on my Primary RADIUS server - meaning that it would be unusual (but of course not impossible) for there to not be a request for a few minutes.

     

    From 11:47:29A -> 11:51:02A I do not see any RADIUS traffic in my logs on the Primary server.

    From 11:42:34A -> 11:51:59A I do see RADIUS traffic in my logs on the Secondary server. (and nothing outside that timeframe)

     

    All the events immediately before and after these time periods look normal. I double checked my switch and the counters look good - the port for my 2008R2 RADIUS server, the port channel group for my Aruba controller, and the three ports that are a member of that group. All these ports and port channels are on the same physical switch.

     

    I ran a show auth-tracebuf like you suggested, but it only went back about 10 minutes or so - well outside the timeframe this issue happened. 

     

    My take on this is that the controller is deciding that the primary RADIUS server is down and sending traffic to the secondary one (since I don't see any traffic on my Primary RADIUS server when this happens).



  • 6.  RE: Source of RADIUS timeouts?

    Posted Oct 24, 2012 04:03 PM

    Can you take the output of "show log security" and also tech-support logs and attache here?

     

    Here is how controller decide Timeout and Server Down situations:

    Whenever controller doesn't get response for radius request, after 5 seconds it retries the packet.

    If controller doesn't get response after 3 retries and meanwhile no other parallel radius request has been answered by the server, than it will mark that server as down.

     

    So I think over here we should try to find out:

    1. Whether timeouts are increasing or not?

    2. when timeouts increase or server is marked out of service, what do you see on the radius server? whether requestst reached there, if server replied.. etc...

     



  • 7.  RE: Source of RADIUS timeouts?

    Posted Oct 24, 2012 04:15 PM

    Is there any way to sanitize the tech support output? I'm paging through it and it has my license keys and all the other goodies that I'm uncomfortable putting up on the Internet. If not, I can do it by hand, just might take me a while :)



  • 8.  RE: Source of RADIUS timeouts?

    Posted Oct 24, 2012 04:28 PM

    Haha... understood your concern. Hmm unfortunately there isn't an easy way to do so.

     

    Being in TAC I always try to look into tech-support logs for any anomalies, but that is generally uploaded on the ticket. Now we are on a public forum, I don't know what to do. :smileysad:

     

    Just send me the multiple output of "show aaa authetication-server radius statistics". Take 5 output on at interval of 1 min.

    send me the output of "show log security all".



  • 9.  RE: Source of RADIUS timeouts?

    Posted Oct 25, 2012 11:14 AM
      |   view attached

    Thanks for the suggestion, I really appreciate you helping me out. Since the information you asked for is too large to paste into a message, I've attached it as a file. I trimmed the security log to just events for today since the problem came back this morning. As you'll see in the attached file, the master controller aaa logs show BOTH RADIUS servers went down at around 9:06AM this morning.

     

    Around that time this appears in the logs:

     

    Oct 25 09:04:12 :121004:  <WARN> |authmgr| |aaa| RADIUS server Primary--10.1.100.102-1812 timeout for client=70:56:81:b7:5f:a1 auth method 802.1x

     

    Since yesterday, I ran a continual ping from the Primary RADIUS server back to the master controller:

     

    Ping statistics for 10.1.100.112:
    Packets: Sent = 75789, Received = 75789, Lost = 0 (0% loss),
    Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 185ms, Average = 0ms

     

    I'm getting more confident that it probably isn't a Layer2/3 problem. With the primary RADIUS server being so unloaded though I don't how how it could possibly be 'too busy' to fullfill a request. 

    Attachment(s)

    txt
    aruba scratch.txt   21 KB 1 version


  • 10.  RE: Source of RADIUS timeouts?

    Posted Oct 26, 2012 02:39 PM

    Here is what I see in the logs:

    It looks that controller is not getting reply back from the server and thus timing out. I would check whether request reached to the server or not and if so whether server replied or not.

     

    Oct 25 09:04:12 :121004:  <WARN> |authmgr| |aaa| RADIUS server Primary--10.1.100.102-1812 timeout for client=70:56:81:b7:5f:a1 auth method 802.1x
    Oct 25 09:04:27 :132197:  <ERRS> |authmgr|  Maximum number of retries was attempted for station jtita 70:56:81:b7:5f:a1 d8:c7:c8:6e:71:a8, deauthenticating the station
    Oct 25 09:04:27 :132053:  <ERRS> |authmgr|  Dropping the radius packet for Station 70:56:81:b7:5f:a1 d8:c7:c8:6e:71:a8 doing 802.1x
    Oct 25 09:05:18 :121004:  <WARN> |authmgr| |aaa| RADIUS server Secondary--10.2.100.102-1812 timeout for client=70:56:81:b7:5f:a1 auth method 802.1x
    Oct 25 09:05:33 :132197:  <ERRS> |authmgr|  Maximum number of retries was attempted for station jtita 70:56:81:b7:5f:a1 d8:c7:c8:6e:71:a8, deauthenticating the station

    Oct 25 07:24:02 :121004:  <WARN> |authmgr| |aaa| RADIUS server Primary--10.1.100.102-1812 timeout for client=5c:0a:5b:45:16:26 auth method 802.1x
    Oct 25 07:24:17 :132197:  <ERRS> |authmgr|  Maximum number of retries was attempted for station mkinsella 5c:0a:5b:45:16:26 d8:c7:c8:6e:74:b0, deauthenticating the station
    Oct 25 07:24:17 :132053:  <ERRS> |authmgr|  Dropping the radius packet for Station 5c:0a:5b:45:16:26 d8:c7:c8:6e:78:90 doing 802.1x
    Oct 25 07:25:02 :132207:  <ERRS> |authmgr|  RADIUS reject for station mkinsella 5c:0a:5b:45:16:26 from server Secondary.
    Oct 25 07:25:02 :132053:  <ERRS> |authmgr|  Dropping the radius packet for Station 5c:0a:5b:45:16:26 d8:c7:c8:6e:75:80 doing 802.1x
    Oct 25 07:27:02 :121004:  <WARN> |authmgr| |aaa| RADIUS server Secondary--10.2.100.102-1812 timeout for client=5c:0a:5b:45:16:26 auth method 802.1x
    Oct 25 07:27:17 :132197:  <ERRS> |authmgr|  Maximum number of retries was attempted for station mkinsella 5c:0a:5b:45:16:26 d8:c7:c8:6e:75:88, deauthenticating the station


  • 11.  RE: Source of RADIUS timeouts?

    Posted Oct 26, 2012 03:17 PM

    The requests don't reach the server. On the server side there is no indication of an error - it just looks like the Aruba controller isn't sending traffic to it.

     

    The ping I've been running from this 'unresponsive' RADIUS server to the Aruba controller still hasn't dropped a single packet - up to 170,000 consecutative responses now. Obviously this doesn't rule out the RADIUS processes on the server being "too busy" but it does make me think there is not a problem at layer 2/3.

     

    I'm thinking at this point I should just open up a case with Aruba.



  • 12.  RE: Source of RADIUS timeouts?

    Posted Nov 06, 2012 02:38 PM

    Did you ever resove your issue? It sounds like I have the same issue in my environment...



  • 13.  RE: Source of RADIUS timeouts?

    Posted Nov 07, 2012 03:35 PM

    I have not. My best guess is that it's some sort of Active Directory or NPS issue, although I cannot find any errors on my NPS server, or the AD Global Catalog server it's using.

     

    Luckily we're not fully on our wireless network, it's still in the testing phases so I only have a couple hundred users on. It does seem to be a daily occurance though...and surely it'll get worse once I turn it lose for the thousands of people that want to use it.

     

    I've just had a couple emergencies come up and I haven't had much time to follow up on this issue. I have turned up my NPS logging and set it to write a new log file daily so it's easier to connect the events.

     

    I'm very interested in hearing about your setup - how we're different and how we're the same. Perhaps we can help each other :)



  • 14.  RE: Source of RADIUS timeouts?

    Posted Aug 19, 2013 10:40 AM

    Hi Guys

     

    Just wondered if anybody had a fix for this issue. I have the same errors in my logs and nothing appearing on my NPS and AD logs. 



  • 15.  RE: Source of RADIUS timeouts?

    EMPLOYEE
    Posted Aug 19, 2013 11:50 PM

    @Trav wrote:

    Hi Guys

     

    Just wondered if anybody had a fix for this issue. I have the same errors in my logs and nothing appearing on my NPS and AD logs. 


    I would open a support case Trav.  There are many reasons why this could happen.  Have support troubleshoot your specific issue.

     



  • 16.  RE: Source of RADIUS timeouts?

    Posted Aug 20, 2013 10:52 AM

    I never reached a satisfactory solution. I worked with Aruba Support and in the end we just ended up disabling the secondary NPS server to 'fix' the issue. It's definitely the CONTROLLER incorrectly marking the NPS as non-responsive. I've logged into the controller and tested the connection to the NPS server from the command line immediately after it went "down" and it works fine.

     

    With only *one* NPS server I don't notice any of the wacky client behavior I was seeing with two set up. Try removing your secondary. The controller still says my primary goes 'down' but when there is no secondary it keep passing traffic to NPS and things work fine.



  • 17.  RE: Source of RADIUS timeouts?

    EMPLOYEE
    Posted Aug 20, 2013 11:00 AM

    @trichmond wrote:

    I never reached a satisfactory solution. I worked with Aruba Support and in the end we just ended up disabling the secondary NPS server to 'fix' the issue. It's definitely the CONTROLLER incorrectly marking the NPS as non-responsive. I've logged into the controller and tested the connection to the NPS server from the command line immediately after it went "down" and it works fine.

     

    With only *one* NPS server I don't notice any of the wacky client behavior I was seeing with two set up. Try removing your secondary. The controller still says my primary goes 'down' but when there is no secondary it keep passing traffic to NPS and things work fine.


    trichmond,

     

    You probably know this, but when there is only a single controller in a server group it will never be marked down.  The most conclusive way to troubleshoot this is to do a packet capture at the NPS server and see who is the last to answer.

     



  • 18.  RE: Source of RADIUS timeouts?

    Posted Aug 20, 2013 11:32 AM

    Thanks for the reply. Yes, my 'resolution' is counting on the fact that a single RADIUS server is never marked as down. If that behavior ever changes, I'm in trouble!

     

    I did work with support to send in packet captures that proved (in my mind) that it was the controller's fault and not NPS. It wasn't even sending anything to the secondary server before marking it as down.

     

    In my environment the client behavior that causes this issue is related to Apple devices. They are doing SOMETHING that the controller doesn't like and tricking the controller into thinking there is a RADIUS issue.

     

    I wish I could be of more help. I'm writing this hoping that it might get someone on the right track. If you're reading this and think you might have the same issue, drop down to a single RADIUS server and see if the experience on the CLIENT end gets better (the controller will still complain about the server going 'down' as long as you have iDevices).



  • 19.  RE: Source of RADIUS timeouts?

    EMPLOYEE
    Posted Aug 20, 2013 11:34 AM

    Did you have EAP offload on at the time?

    Did you have fail through enabled?

     



  • 20.  RE: Source of RADIUS timeouts?

    Posted Aug 20, 2013 12:14 PM

    To be completely honest I did the bulk of my troubleshooting a year ago and decided to drop down to the single RADIUS server. After working up the chain at Aruba support and not getting a resolution, I kind of crossed it off my list and haven't thought about it since.

     

    I can't give you an accurate answer. I just don't remember. I do remember fooling with those settings with support on the phone but can't remember the specifics. Sorry...I do appreciate the help.

     

    The fact that I only saw this problem with Apple clients and not Android/Windows/OS X clients stands out in my mind.

     

    I would encourage anyone else reading this to contact Aruba support. It's possible they've addressed the issue or have a different work around. I'm happy with my single NPS server at the moment. 



  • 21.  RE: Source of RADIUS timeouts?

    Posted Aug 20, 2013 12:40 PM

    Sorry to bump this up, but I have been looking at a similar, ney, identical problem. I have a TAC case (#1440376) and am still waiting for a period when I see consecutive drops and can capture it with Wireshark. I also have security debugging enabled. My issue is that it is only affecting two radius servers for one auth group when other servers have been up as long as the controller was last rebooted.

     

    I too can find no network issues and the server team report nothing amiss with the radius server (Tokyo to Hong Kong), the MPLS network is fine, it never drops a single packet and sits at 50ms response time....always.

     

    My only workaround is to to set "Auth Server dead time" to 0, then all users are happy as they are not being bumped through to a radius server that does not authenticate them.

     

    This is really very frustrating and impossible to locate why. My only working theory is that as the LAN is sending out hundreds of machine auth requests, which won't work as machine auth is not used, that the server is blacklisting the NAS IP and controller.

     

    I am still hunting for an answer.



  • 22.  RE: Source of RADIUS timeouts?

    EMPLOYEE
    Posted Aug 20, 2013 12:54 PM

    I would try upgrading to 6.1.3.9 or 6.2.1.2.  In the release notes, there is a fixed bug, 76484:

     

    Symptom: RADIUS authentication failed in networks that had different Maximum Transmission Values (MTUs). To fix this issue, the socket options are updated to allow the controller to send RADIUS requests to the RADIUS server when EAP termination is enabled.
    Scenario: The RADIUS authentication failed when the MTU value in the network between the controller and RADIUS server was different. This issue was observed in controllers running ArubaOS 6.1.3.x



  • 23.  RE: Source of RADIUS timeouts?

    Posted Aug 20, 2013 01:09 PM

    Thanks for the reply. Currently it is running: ArubaOS (MODEL: Aruba3400), Version 6.1.2.7



  • 24.  RE: Source of RADIUS timeouts?

    EMPLOYEE
    Posted Aug 20, 2013 01:23 PM

    You might want to open a case and reference that bug.  Not sure if that specific issue exists on your version of code.

     



  • 25.  RE: Source of RADIUS timeouts?

    Posted Aug 21, 2013 06:56 AM

    Thanks, I informed the TAC regarding the above.



  • 26.  RE: Source of RADIUS timeouts?

    Posted Aug 21, 2013 08:04 AM

    I have just checked all the APAC local controllers and all have the same problem. The only controller that does not is the standby controller in Tokyo, but this sends no traffic. Even the local controller in Hong Kong where the radius server is located has the same problem, but the Hong Kong Master, which is the region Master does not. Sites include:

     

    Tokyo

    Hong Kong – Same site as Radius

    Manila

    Mumbai

    Sydney

    Taiwan

     

    Seems like a bug to me.



  • 27.  RE: Source of RADIUS timeouts?

    EMPLOYEE
    Posted Aug 21, 2013 08:07 AM

    Well, let us see what support says.  Since you have the problem, any information that you can give them to help them determine what is going on will help you get to the bottom of this.



  • 28.  RE: Source of RADIUS timeouts?

    Posted Mar 07, 2014 06:15 AM

    The issue was due to the first radius frame being sent from the controller with the df bit set, so the frame was dropped and the controller would still wait for a reply. None was received, so it would then mark the server as down for 10 mins, which I think is the default dead timer. The workaround was to set the auth dead time to 0 -  Case # 1440376.

     

    Since we have upgraded to 6.3.1.2, I now see "AvgRspTm" from a local on the same LAN as the radius server running at "-844826803" - I don't understand this. I am getting in touch with Aruba Support again and trying to raise a cross referenced case. I have been told this morning that the radius server CPU was running very high, that is being looked into now.



  • 29.  RE: Source of RADIUS timeouts?

    Posted Mar 10, 2014 12:44 PM

    Regarding the AVgRspTme in negative values, we already have a bug raised with the engineering team. Bug# 89169.

     

    A patch request has been raised for the issue to be addressed in 6.3 stream

     

    <tac engineer>@arubanetworks.com



  • 30.  RE: Source of RADIUS timeouts?
    Best Answer

    Posted Apr 07, 2014 06:46 AM

    I'm going to add some more information to this thread.

     

    We have also been seeing Radius server uptime issues on the ArubaOS cli. For all of our local controllers in EMEA, the uptime will keep resetting. It is now very likely that this is a client configuration issue. The default TTL on each request in ArubaOS is 3x10. If after the third attempt, no response from the Radius server (MS NPS, 2008) is received, the controller marks the server as down, default 10mins. This is regardless of whether there is other radius traffic passing, any one single 3x10 failure will mark the server down. Additionally, if the same problem is seen on the fail through server, the primary server is brought back in to service. Consequently when troubleshooting, the auth debug output can be full of:

     

     

    Apr  4 13:42:49  authmgr[1710]: <124004> <DBUG> |authmgr|  Auth server <servername>' response=2

    Apr  4 13:42:49  authmgr[1710]: <124014> <NOTI> |authmgr|  Taking Server <servername> out of service for 10 mins

     

    Apr  4 13:42:51  authmgr[1710]: <124004> <DBUG> |authmgr|   server=<servername>, ena=1, ins=0 (1)

    Apr  4 13:43:00  authmgr[1710]: <124004> <DBUG> |authmgr|   server=<servername>, ena=1, ins=0 (1)

    /snip

     

    Then:

     

    Apr  4 13:43:24  authmgr[1710]: <124015> <NOTI> |authmgr|  Bringing Server <servername> back in service.

     

    It seems the cause of this is clients using EAP instead of PEAP.

     

    What we see on the Radius logs are:

     

    Authentication Details:

                    Connection Request Policy Name:           1-Secure Wireless Connections Aruba

                    Network Policy Name:                   Secure Wireless Connections Aruba London

                    Authentication Provider:                              Windows

                    Authentication Server:                 <servername>

                    Authentication Type:                     EAP

                    EAP Type:                                            -

                    Account Session Identifier:                          -

                    Reason Code:                                    1

                    Reason: An internal error occurred. Check the system event log for additional information.

                                         

    A successful request receives:

     

    Authentication Details:

                Connection Request Policy Name:          1-Secure Wireless Connections Aruba

                Network Policy Name:                     Secure Wireless Connections Aruba London

                Authentication Provider:                 Windows

                Authentication Server:                    <servername>

                Authentication Type:                       PEAP

                EAP Type:                             Microsoft: Secured password (EAP-MSCHAP v2)

                Account Session Identifier:                        -

     

    Quarantine Information:

                Result:                                               Full Access

                Extended-Result:                             -

                Session Identifier:                            -

                Help URL:                             -

                System Health Validator Result(s):         

     

    We can see this in Wireshark with constant failures to reply to the controller from the Radius box. Increasing the retransmits will not solve this, the server will still not reply.

     

    Moral of the story here - Get a Wireshark capture and press server admins to scrutinise the server logs.