10-22-2012 06:29 AM
I'm trying to get the bottom of a RADIUS issue with my Aruba deployment. I have two sites and each site has a 3600 controller on the latest firmware. The controller at my primary site is a Master and the other controller at the other site is a Local. Each site has a Server 2008R2 using the built-in NPS for RADIUS. I'm using PEAP and terminating at the RADIUS servers. Using ADCS I made a self-signed certificate for each RADIUS server and pushed out the cert to all my domain clients. The sites have different subnets and traffic between them is routed. In a broad sense all this is working fine.
I have both of the RADIUS servers I mentioned in a Server Group. The RADIUS server at my primary site is on top and Fail Through is not checked on the Server Group options.
My problem is that the Master controller decides that the first RADIUS server is down and sends clients to the second one. Then if they're lucky the client will prompt them to accept the second RADIUS' servers certificate - if they're unlucky they have to disjoin & rejoin the wireless network, at which point they're prompted to accept the second servers' RADIUS servers certificate.
I figured this was a layer 2 or 3 problem but I cannot seem to find an issue anywhere. The Master controller and the primary RADIUS server are on the same local subnet. They're even on the same switch. I do not see any errors on my switch. I don't see any timeout errors on my NPS RADIUS logs. I ran a continual ping from the RADIUS server to the local controller and didn't drop a single packet while things were working correctly AND while they weren't. The NPS server was brought up specifically for this wireless deployment and it is not doing anything else at the moment. Looking through the NPS logs it's serving/denying clients maybe every couple seconds so it can't be overloaded.
#show aaa authentication-server radius statistics from 7:28A this morning:
RADIUS Server Statistics ------------------------ Statistics Primary Secondary ---------- ----- ------ Accounting Requests 0 0 Raw Requests 266 0 PAP Requests 0 0 CHAP Requests 0 0 MS-CHAP Requests 0 0 MS-CHAPv2 Requests 0 0 Mismatch Response 0 0 Bad Authenticator 0 0 Access-Accept 23 0 Access-Reject 0 0 Accounting-Response 0 0 Access-Challenge 243 0 Unknown Response code 0 0 Timeouts 0 0 AvgRespTime (ms) 10 0 Total Requests 266 0 Total Responses 266 0 Uptime (d:h:m) 0:0:30 0:0:30 SEQ Total/Free 255/255 255/255
#show aaa authentication-server radius statistics 2 minutes later:
RADIUS Server Statistics ------------------------ Statistics Primary Secondary ---------- ----- ------ Accounting Requests 0 0 Raw Requests 295 14 PAP Requests 0 0 CHAP Requests 0 0 MS-CHAP Requests 0 0 MS-CHAPv2 Requests 0 0 Mismatch Response 0 0 Bad Authenticator 0 0 Access-Accept 23 1 Access-Reject 0 0 Accounting-Response 0 0 Access-Challenge 271 13 Unknown Response code 0 0 Timeouts 4 0 AvgRespTime (ms) 10 196 Total Requests 295 14 Total Responses 294 14 Uptime (d:h:m) 0:0:0 0:0:32 SEQ Total/Free 255/255 255/255
Approximately 41 minutes later the Master controller decided the Primary RADIUS server was back up and started sending traffic to it. This entire time the RADIUS server was continually pinging the Master controller without dropping a packet. As I mentioned above, the NPS logs don't show anything unusual right up to and during this event.
Solved! Go to Solution.
10-23-2012 09:51 AM
Thanks for your reply!
I cleared out my RADIUS statistics with #clear aaa authentication-server radius statistics and issued logging level informational security subcat aaa. Now time to sit back and wait for it to happen again :smileyindifferent:
I'll update this thread with my results.
10-23-2012 09:57 AM
Once you see requests timed out in the logs, check "show auth tracebuf" output on the controller as well as check your RADIUS server logs and see what is happening. Is request even reaching to the server? If so, is server replying? or is discarding the request?
Let's follow this step. If this doesn't help than I would suggest you to take controller internal packet captures on udp port 1812 to see where is the drop. But let's do one step at a time.
10-24-2012 11:18 AM
I did not have any problems at all yesterday. I cleared my counters at the beginning of the morning and all traffic was sent to my Primary RADIUS server for the entire day.
Unfortunately today is a different story.
I looked at show aaa authentication-server radius statistics this morning and noticed both my Primary and Secondary RADIUS server reported an uptime of 1h58m (which it should have been at least a day)
I took a look and sure enough requests started being directed at my Secondary RADIUS server around that timeframe.
Checking the Server 2008R2 logs, I don't see any outward indication of network or NPS troubles. I've noticed a trend of about 6-8 requests per minute on my Primary RADIUS server - meaning that it would be unusual (but of course not impossible) for there to not be a request for a few minutes.
From 11:47:29A -> 11:51:02A I do not see any RADIUS traffic in my logs on the Primary server.
From 11:42:34A -> 11:51:59A I do see RADIUS traffic in my logs on the Secondary server. (and nothing outside that timeframe)
All the events immediately before and after these time periods look normal. I double checked my switch and the counters look good - the port for my 2008R2 RADIUS server, the port channel group for my Aruba controller, and the three ports that are a member of that group. All these ports and port channels are on the same physical switch.
I ran a show auth-tracebuf like you suggested, but it only went back about 10 minutes or so - well outside the timeframe this issue happened.
My take on this is that the controller is deciding that the primary RADIUS server is down and sending traffic to the secondary one (since I don't see any traffic on my Primary RADIUS server when this happens).
10-24-2012 01:02 PM
Can you take the output of "show log security" and also tech-support logs and attache here?
Here is how controller decide Timeout and Server Down situations:
Whenever controller doesn't get response for radius request, after 5 seconds it retries the packet.
If controller doesn't get response after 3 retries and meanwhile no other parallel radius request has been answered by the server, than it will mark that server as down.
So I think over here we should try to find out:
1. Whether timeouts are increasing or not?
2. when timeouts increase or server is marked out of service, what do you see on the radius server? whether requestst reached there, if server replied.. etc...
10-24-2012 01:14 PM - edited 10-24-2012 01:15 PM
Is there any way to sanitize the tech support output? I'm paging through it and it has my license keys and all the other goodies that I'm uncomfortable putting up on the Internet. If not, I can do it by hand, just might take me a while :)
10-24-2012 01:28 PM
Haha... understood your concern. Hmm unfortunately there isn't an easy way to do so.
Being in TAC I always try to look into tech-support logs for any anomalies, but that is generally uploaded on the ticket. Now we are on a public forum, I don't know what to do. :smileysad:
Just send me the multiple output of "show aaa authetication-server radius statistics". Take 5 output on at interval of 1 min.
send me the output of "show log security all".
10-25-2012 08:14 AM
Thanks for the suggestion, I really appreciate you helping me out. Since the information you asked for is too large to paste into a message, I've attached it as a file. I trimmed the security log to just events for today since the problem came back this morning. As you'll see in the attached file, the master controller aaa logs show BOTH RADIUS servers went down at around 9:06AM this morning.
Around that time this appears in the logs:
Oct 25 09:04:12 :121004: <WARN> |authmgr| |aaa| RADIUS server Primary--10.1.100.102-1812 timeout for client=70:56:81:b7:5f:a1 auth method 802.1x
Since yesterday, I ran a continual ping from the Primary RADIUS server back to the master controller:
Ping statistics for 10.1.100.112:
Packets: Sent = 75789, Received = 75789, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 185ms, Average = 0ms
I'm getting more confident that it probably isn't a Layer2/3 problem. With the primary RADIUS server being so unloaded though I don't how how it could possibly be 'too busy' to fullfill a request.
10-26-2012 11:38 AM
Here is what I see in the logs:
It looks that controller is not getting reply back from the server and thus timing out. I would check whether request reached to the server or not and if so whether server replied or not.
Oct 25 09:04:12 :121004: <WARN> |authmgr| |aaa| RADIUS server Primary--10.1.100.102-1812 timeout for client=70:56:81:b7:5f:a1 auth method 802.1x Oct 25 09:04:27 :132197: <ERRS> |authmgr| Maximum number of retries was attempted for station jtita 70:56:81:b7:5f:a1 d8:c7:c8:6e:71:a8, deauthenticating the station Oct 25 09:04:27 :132053: <ERRS> |authmgr| Dropping the radius packet for Station 70:56:81:b7:5f:a1 d8:c7:c8:6e:71:a8 doing 802.1x Oct 25 09:05:18 :121004: <WARN> |authmgr| |aaa| RADIUS server Secondary--10.2.100.102-1812 timeout for client=70:56:81:b7:5f:a1 auth method 802.1x Oct 25 09:05:33 :132197: <ERRS> |authmgr| Maximum number of retries was attempted for station jtita 70:56:81:b7:5f:a1 d8:c7:c8:6e:71:a8, deauthenticating the station
Oct 25 07:24:02 :121004: <WARN> |authmgr| |aaa| RADIUS server Primary--10.1.100.102-1812 timeout for client=5c:0a:5b:45:16:26 auth method 802.1x Oct 25 07:24:17 :132197: <ERRS> |authmgr| Maximum number of retries was attempted for station mkinsella 5c:0a:5b:45:16:26 d8:c7:c8:6e:74:b0, deauthenticating the station Oct 25 07:24:17 :132053: <ERRS> |authmgr| Dropping the radius packet for Station 5c:0a:5b:45:16:26 d8:c7:c8:6e:78:90 doing 802.1x Oct 25 07:25:02 :132207: <ERRS> |authmgr| RADIUS reject for station mkinsella 5c:0a:5b:45:16:26 from server Secondary. Oct 25 07:25:02 :132053: <ERRS> |authmgr| Dropping the radius packet for Station 5c:0a:5b:45:16:26 d8:c7:c8:6e:75:80 doing 802.1x Oct 25 07:27:02 :121004: <WARN> |authmgr| |aaa| RADIUS server Secondary--10.2.100.102-1812 timeout for client=5c:0a:5b:45:16:26 auth method 802.1x Oct 25 07:27:17 :132197: <ERRS> |authmgr| Maximum number of retries was attempted for station mkinsella 5c:0a:5b:45:16:26 d8:c7:c8:6e:75:88, deauthenticating the station