well the show ap dlts-allowed list only contains 40 of the sites 59 AP's,, so probably the reason fot all my roaming errors is right there.
How the HE*** can Aruba send out so poor software that essentially renders roaming useless on sites? As I understand you guys, this is universal for 10.5.x and 10.6.x firmwares..... Unbelivable.
Original Message:
Sent: Aug 13, 2024 09:13 AM
From: Mflowers@beta.team
Subject: Lots of client roaming error events in Central
You are not the first person to have roaming issues/PMK issues. There are quite a few people that have said something about it on the forums.
https://community.arubanetworks.com/discussion/aruba-central-controllerless-environment-is-not-working
^Here is a thread of someone that had tons of roaming issues. They were on 10.5.x which is really bad with roaming. They had some other issues but we got on a call and I helped him out.
"
What I do know from troubleshooting roaming in the past and DTLS is there is a shared set of APs with a DTLS neighbour allow list. If the messages are outside of that allow list, they are dropped. This means groups of APs within earshot of each other do get KMS updates for that group, but not others outside of earshot."
Yeah, I have seen the same thing. I think Aruba Central will send a "neighbor" list to the APs and that is the APs the client could potentially roam to. The issue with this is that a user could close their laptop (or device goes to sleep) and walk to another part of the building and then have issues due to roaming failing.
The PMK-R0 key exists on the original AP and a PMK-R1 gets synced to the other APs. If there is no PMK-R1 on the AP it will try to contact the PMK-R0 holder (the original AP) over PAPI. Since Aruba Central sends a "neighbor allow list ", that means not all APs can communicate in the cluster and get PMK-R1 keys and can not reach the PMK-R0 holder.
You can see the cluster connections on the APs with this command:
show cluster-security connections
In 10.4.1.3 (what I am currently running) I can see all APs are in the "show ap dtls allowed-aps". One of my sites has around 80 APs and they are all in the dtls allowed-aps list.
I can run "show cluster-security connections" and I see around 70 connections on one of the APs. Any peer connection I do not see I can run the below commands to get it to communicate with the cluster. I think this is where a lot of the PMK/roaming issues are in other versions. It tries to connect to other APs in the cluster but can not. You can check your "show log papi-handler" to see if there is any DTLS failures. If there is DTLS failures, check the aruba central logs and see if they are around the same time as PMK-R0 holder is unreachable.
S40-AP-51-F2# show cluster-security connections | i 10.xxx.xxx.1573f7f83eb 3fbf790e connected R 10.xxx.xxx.151[4434] 10.xxx.xxx.157[4434] 81957 71634 26m:11s 02m:00s 07h:16m:49sS40-AP-51-F2# show cluster-security connections | i 10..xxx.xxx.158S40-AP-51-F2# dtls test 10.xxx.xxx.158S40-AP-51-F2# dtls test-ephemeral a8:5b:f7:xx:xx:xx 10..xxx.xxx.158S40-AP-51-F2# show cluster-security connections | i 10..xxx.xxx.1583f7f8400 5ffd5fdb connected I 10..xxx.xxx.151[4434] 10..xxx.xxx.158[4434] 174 174 01m:16s 58m:45s 07h:38m:46s
Before I upgraded to 10.4.1.x I used a script to add all of my APs to the DTLS neighbor list. I ran that every day.
# Install-Module -Name Posh-SSH -Force# Load Posh-SSH moduleImport-Module Posh-SSH# Define the devices, commands, and credentials$devices = @( "10.xxx.xxx.101", "10.xxx.xxx.102")$arubapass = ConvertTo-SecureString "PASSWORD" -AsPlainText -Force $arcred = New-Object System.Management.Automation.PSCredential ("USERNAME", $arubapass)$commands = @( "dtls add-neigh a8:5b:f7:xx:xx:xx 10.xxx.xxx.101", "dtls add-neigh a8:5b:f7:xx:xx:xx 10.xxx.xxx.102", "dtls test 10.xxx.xxx.101", "dtls test 10.xxx.xxx.102")# Iterate through each deviceforeach ($device in $devices) { # Establish SSH session $session = New-SSHSession -ComputerName $device -Credential $arcred -AcceptKey -Force $SSHStream = New-SSHShellStream -Index 0 -BufferSize 9999 start-sleep -s 2 foreach ($command in $commands) { $SSHStream.WriteLine($command) start-sleep -s 1 $SSHStream.read() } # Close the session Get-SSHSession | Remove-SSHSession}
Original Message:
Sent: Aug 10, 2024 12:45 AM
From: Cranbrook School
Subject: Lots of client roaming error events in Central
I'm interested in this thread too. I've also had problems with roaming since aos10, in this case OKC only due to 11r causing many older clients (mostly apple) to fail to connect.
Never had problems with roaming on aos8 with controllers. Only been an issue since aos10 with gateways.
Thanks for the link regarding KMS operation. Very useful.
What I do know from troubleshooting roaming in the past and DTLS is there is a shared set of APs with a DTLS neighbour allow list. If the messages are outside of that allow list, they are dropped. This means groups of APs within earshot of each other do get KMS updates for that group, but not others outside of earshot.
However, I still see most roaming causing full dot1x auths, even between radios on the same APs.
Once I get past some of the other hurdles we have, I'll be taking another stab at this problem, as we pretty much have no seamless roaming in our network.
I did two tac cases last year about it, but they took so long and I was so busy, I just couldn't keep working on them.
Interested to see how this thread goes... :)
Original Message:
Sent: Aug 09, 2024 01:31 PM
From: Mflowers@beta.team
Subject: Lots of client roaming error events in Central
"opening a TAC case and waiting for it to escalate to serious levels kinda tests my patience"
I wish this statement wasn't so very true. Working with TAC is painful.
I have dealt with a lot of roaming issue and AOS10. There has been tons to updates to roaming in different versions but I have found that 10.4.1.2 to be the version to finally fix our roaming issues. I am currently running 10.4.1.3 and roaming working most of the time - still get a few issues here and there but nothing major.
The way the PMK cache and AOS10 works is different than it use to. Before it use to sync it to the controller or IAP virtual controller but in AOS10 the PMK is all synced to the cloud and then back down to neighboring APs.
https://www.arubanetworks.com/techdocs/central/2.5.8/content/aos10x/aos10x-overview/aos10-kms-workflow.htm
I think the neighboring APs were synced to the APs by this (Someone correct me here if I am wrong - this is just my guess based on hours of troubleshooting):
show ap dtls provisioned-neighlist
https://www.arubanetworks.com/techdocs/CLI-Bank/Content/aos10/showap-dtls-pn.htm
You are getting "Reason: AP is resource constrained" - This is telling me the AP is getting overloaded somehow. I am not sure why but I have a few theories:
1. Aruba doesn't properly let the client know the PMKID is invalid and the client just keeps trying to connect and fails repeatly.
2. There is no central PMK cache holder (Controller/IAP VC) and the PMK cache is synced to all APs. This could be eating up a lot of resources.
^These are all guesses. I haven't check packet capture logs with PMK issues in months but back in 10.4.0.x/10.5.x this was true.
I would be curious to see the output of these commands:
show cluster-security peers (how many peers do you have?)
show cluster-security stats (run this multiple times and see how fast the numbers are increasing)
show cpu
show memory
show ap pmkcache (what is your PMK cache count?)
In your AP group -> Config -> Radios -> radio profile -> What is your ARM/WIDS Override set to?
In your AP group -> Config -> Security -> Wireless IDS/IPS -> What is your detection levels set to?
Last question - Is there a reason that you are on 10.6.x? It might be work looking into downgrading to 10.4.1.3 as I have fought with PMK issues for a long time until 10.4.1.x.
Our setup:
Aruba CX switches - 4100i (PLC switches) /6300M (IDF/Access switches - Stacked) / 8325 (Core/AGG switches - VSX)
Aruba APs - Mostly AP-650s with a few AP-577s. All APs are using LAG to two switches.
Aruba Controllers - 9240 x 4 and 9012 x 2
Using Clearpass for all authentication (wired/wireless).
7 Sites and about 200 Aruba devices in Central
Original Message:
Sent: Aug 05, 2024 02:50 PM
From: Keyser
Subject: Lots of client roaming error events in Central
Not yet no - was hoping someone in here could help as opening a TAC case and waiting for it to escalate to serious levels kinda tests my patience 😂
Original Message:
Sent: Aug 05, 2024 11:28 AM
From: schmelzle
Subject: Lots of client roaming error events in Central
I assume you have a TAC case open?
Original Message:
Sent: Aug 05, 2024 09:57 AM
From: Tue Madsen
Subject: Lots of client roaming error events in Central
Client limit is 128 in the SSID profile but there is only between 2 - 8 connected clients/AP in general - including the AP's where these errors are being logged.
EDIT: Maybe I should mention especially the "AP is ressource constrained" error comes in almost "storms". I can have one client perhaps being responsible for 200 of these Loops of entries (two loops shown below) within 10 min on an access point. A small example of a pattern that can happen hundreds of times within minutes:
Aug 05, 2024, 14:40:15:627,"API120-001","AP","Client 802.11 De-authentication from Client","De-authentication sent from client xx:xx:xx:xx:93:12 to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001 Reason: AP is resource constrained"
Aug 05, 2024, 14:40:15:619,"API120-001","AP","Client PMK/OKC Key Add","Operation ADD for key cache entry with sequence number 54321 and TTL 28800 seconds"
Aug 05, 2024, 14:40:15:603,"API120-001","AP","Client Radius Accounting Start","Radius Accounting start initiated from client xx:xx:xx:xx:93:12 associated to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001 to Radius Server 10.1.100.102"
Aug 05, 2024, 14:40:15:602,"API120-001","AP","Client Role Assigned","Role IBC WiFi assigned to client xx:xx:xx:xx:93:12 associated to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:15:602,"API120-001","AP","Client Roaming","Roam probe sent by API120-001 for Client xx:xx:xx:xx:93:12"
Aug 05, 2024, 14:40:15:601,"API120-001","AP","Client 802.1x Radius Accept","802.1x Radius Accept received from Server 10.1.100.102 for client xx:xx:xx:xx:93:12 associated to BSSID MAC xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001 "
Aug 05, 2024, 14:40:15:601,"API120-001","AP","Client EAP Success","EAP success to client xx:xx:xx:xx:93:12 associated to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:15:306,"API120-001","AP","Client 802.11 Association Success","802.11 Association success to client xx:xx:xx:xx:93:12 from BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:15:305,"API120-001","AP","Client 802.11R Association Request","802.11r Association request from client xx:xx:xx:xx:93:12 to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:15:290,"API120-001","AP","Client 802.11 Authentication Success","802.11 Authentication success to client xx:xx:xx:xx:93:12 from BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:15:289,"API120-001","AP","Client 802.11 Authentication Request","802.11 Authentication request from client xx:xx:xx:xx:93:12 to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:15:039,"API120-001","AP","Client Radius Accounting Stop","Radius Accounting stop initiated from client xx:xx:xx:xx:93:12 associated to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001 to Radius Server 10.1.100.102"
Aug 05, 2024, 14:40:15:038,"API120-001","AP","Client Onboarding Event","Client Onboarding Event"
Aug 05, 2024, 14:40:15:037,"API120-001","AP","Client 802.11 De-authentication from Client","De-authentication sent from client xx:xx:xx:xx:93:12 to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001 Reason: AP is resource constrained"
Aug 05, 2024, 14:40:15:028,"API120-001","AP","Client PMK/OKC Key Add","Operation ADD for key cache entry with sequence number 54319 and TTL 28800 seconds"
Aug 05, 2024, 14:40:15:021,"API120-001","AP","Client Radius Accounting Start","Radius Accounting start initiated from client xx:xx:xx:xx:93:12 associated to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001 to Radius Server 10.1.100.102"
Aug 05, 2024, 14:40:15:020,"API120-001","AP","Client Roaming","Roam probe sent by API120-001 for Client xx:xx:xx:xx:93:12"
Aug 05, 2024, 14:40:15:019,"API120-001","AP","Client EAP Success","EAP success to client xx:xx:xx:xx:93:12 associated to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:15:019,"API120-001","AP","Client Role Assigned","Role IBC WiFi assigned to client xx:xx:xx:xx:93:12 associated to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:15:018,"API120-001","AP","Client 802.1x Radius Accept","802.1x Radius Accept received from Server 10.1.100.102 for client xx:xx:xx:xx:93:12 associated to BSSID MAC xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001 "
Aug 05, 2024, 14:40:14:748,"API120-001","AP","Client 802.11 Association Success","802.11 Association success to client xx:xx:xx:xx:93:12 from BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:14:747,"API120-001","AP","Client 802.11R Association Request","802.11r Association request from client xx:xx:xx:xx:93:12 to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:14:729,"API120-001","AP","Client 802.11 Authentication Success","802.11 Authentication success to client xx:xx:xx:xx:93:12 from BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:14:728,"API120-001","AP","Client 802.11 Authentication Request","802.11 Authentication request from client xx:xx:xx:xx:93:12 to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001"
Aug 05, 2024, 14:40:14:491,"API120-001","AP","Client Radius Accounting Stop","Radius Accounting stop initiated from client xx:xx:xx:xx:93:12 associated to BSSID xx:xx:xx:xx:9e:f2 on channel 11 of AP hostname API120-001 to Radius Server 10.1.100.102"
Aug 05, 2024, 14:40:14:490,"API120-001","AP","Client Onboarding Event","Client Onboarding Event"
Original Message:
Sent: Aug 05, 2024 09:51 AM
From: schmelzle
Subject: Lots of client roaming error events in Central
How many clients per AP? What is max clients set to in the SSID profile?
Original Message:
Sent: Aug 05, 2024 09:17 AM
From: Tue Madsen
Subject: Lots of client roaming error events in Central
Hi
Were running AOS 10.6.0.2 central managed on AP-635's with 7205 WLAN gateways for clients, and things seems to be working except for roaming which is very sluggish.
We have a TON of client roamaing error events in central like these:
- Onboarding failed for client xx:xx:xx:xx:xx:xx in Authentication/Association phase to BSSID yy:yy:yy:yy:yy:yy on channel 6 of AP hostname AP231. Reason: Pairwise master key (PMK-R0) key holder (R0KH) unreachable
- Onboarding failed for client xx:xx:xx:xx:xx:xx in Authentication/Association phase to BSSID yy:yy:yy:yy:yy:yy on channel 6 of AP hostname AP444. Reason: Invalid fast transition element (FTE)
- Onboarding failed for client xx:xx:xx:xx:xx:xx in Authentication/Association phase to BSSID yy:yy:yy:yy:yy:yy on channel 128- of AP hostname AP737. Reason: Association request rejected temporarily; try again later
- Onboarding failed for client xx:xx:xx:xx:xx:xx in Authentication/Association phase to BSSID yy:yy:yy:yy:yy:yy on channel 128- of AP hostname AP208. Reason: Invalid pairwise master key identifier (PMKID)
- Onboarding failed for client xx:xx:xx:xx:xx:xx in Deauthentication/Disassociation phase to BSSID yy:yy:yy:yy:yy:yy on channel 11 of AP hostname AP120. Reason: AP is resource constrained
We have enabled OCK and 802.11r & k on the WPA3-Enterprise CCM-128 SSID with WPA3 Transition Enabled
Is AOS10 still not mature for general production or is there some possible misconfiguration that can cause these thousands and thousands of errors a day (for about 300 clients)?