I have a very odd issue & I'm posting it to the community to see if I can get any useful suggestions.
I'm in the process of migrating from 5 x 3600 controllers running AOS 6.4.3.6 onto 2 x 7220 controllers, also running AOS 6.4.3.6. As such, I'm taking the opportunity to streamline my configs & get rid of redundant or irrelevant configurations. I'm almost there; however, I've run into a problem.
I have a case open w/ TAC to review my configs & help troubleshoot.
The problem is... After about 30 minutes of connectivity, the 2 mobile devices I test with seem to drop their connections. A laptop running Windows 7 does not experience the same issue.
Testing involves clients connecting to a configured RAP (AP-105). The RAP's config was previously running on the old set of 3600 controllers - I know that isn't the issue. Why a RAP? We remotely run the networking for a small college & use the RAP to test new configs & the general state of networking from our office.
Clients (thus far) are an Apple iPad Mini 2, running iOS 9.3.5 & an Nexus 7 2013, running Android 6.0.1. I'll try to expand testing clients to include Windows 10 & a Mac OS X client; however, as mentioned above a Win7 clients does not seem to experience the same issue.
Clients can successfully associate, authenticate, & get on the proper network.
Clients successfully associate & authenticate to an 802.1x network. We use ClearPass as our RADIUS server w/ an AD Authentication backend.
Once clients drop their wifi connection, I notice that they're still in the user table (show user) & appear to still be associated (show ap assoc), which is contrary to the client experience.
Furthermore, once the Android client drops is connection, I see an increased number of successful RADIUS authentications on the ClearPass server (anywhere from 15 to 30 seconds apart) for a bit, yet the client still thinks its not connected. This behavior will continue until I forcefully put the client to sleep, only then will it cease to create RADIUS auth entries.
My iOS client also disconnects some 30 - 45 minutes after successfully joining the network, but it will not continue to attempt RADIUS authentications after it has dropped form the network. The iOS device appears to successfully reconnect after its put to sleep for some untracked amount of time.
Both clients are awake & actively performing an iperf test when they drop their connections. I had initially observed these drops while the devices were idle & thought them strange enough that I started streaming YouTube videos to see if they still dropped when receiving traffic - they did. TAC instructed me to make the client generate traffic such that I don't trigger idle timeout, in the event that my buffers are full enough to cause such a thing... I'm dubious.
I've configured these clients with user-debug logging & have provided them to TAC for further analysis. I'm focusing on my Android test for the time begin - for obvious reasons.
The only concrete thing I've been able to get out of it, is a dreaded Unspecified Failure.
Sep 7 18:49:27 :522296: <DBUG> |authmgr| Auth GSM : USER_STA delete event for user d8:50:e6:8a:ff:a8 age 0 deauth_reason 1
Sep 7 18:49:27 :522036: <INFO> |authmgr| MAC=d8:50:e6:8a:ff:a8 Station DN: BSSID=00:24:6c:b4:ac:b2 ESSID=test-cabrini-eduroam VLAN=646 AP-name=00:24:6c:c3:4a:cb
Sep 7 18:49:27 :522234: <DBUG> |authmgr| Setting idle timer for user d8:50:e6:8a:ff:a8 to 300 seconds (idle timeout: 300 ageout: 0).
Sep 7 18:49:27 :501000: <DBUG> |stm| Station d8:50:e6:8a:ff:a8: Clearing state
Sep 7 18:49:27 :501102: <NOTI> |AP 00:24:6c:c3:4a:cb@1.1.1.2 stm| Disassoc from sta: d8:50:e6:8a:ff:a8: AP 1.1.1.2-00:24:6c:b4:ac:b2-00:24:6c:c3:4a:cb Reason Unspecified Failure
Sep 7 18:49:27 :501000: <DBUG> |AP 00:24:6c:c3:4a:cb@1.1.1.2 stm| Station d8:50:e6:8a:ff:a8: Clearing state
Sep 7 18:49:28 :501109: <NOTI> |AP 00:24:6c:c3:4a:cb@1.1.1.2 stm| Auth request: d8:50:e6:8a:ff:a8: AP 1.1.1.2-00:24:6c:b4:ac:b2-00:24:6c:c3:4a:cb auth_alg 0
Sep 7 18:49:28 :501093: <NOTI> |AP 00:24:6c:c3:4a:cb@1.1.1.2 stm| Auth success: d8:50:e6:8a:ff:a8: AP 1.1.1.2-00:24:6c:b4:ac:b2-00:24:6c:c3:4a:cb
Sep 7 18:49:28 :501100: <NOTI> |stm| Assoc success @ 18:49:28.539679: d8:50:e6:8a:ff:a8: AP 1.1.1.2-00:24:6c:b4:ac:b2-00:24:6c:c3:4a:cb
Sep 7 18:49:28 :522295: <DBUG> |authmgr| Auth GSM : USER_STA event 0 for user d8:50:e6:8a:ff:a8
Sep 7 18:49:28 :522035: <INFO> |authmgr| MAC=d8:50:e6:8a:ff:a8 Station UP: BSSID=00:24:6c:b4:ac:b2 ESSID=test-cabrini-eduroam VLAN=646 AP-name=00:24:6c:c3:4a:cb
I've been working w/ TAC now for about a week.
I believe I've ruled out the Android client by connecting it to a different cluster / network (that's ran locally) w/ a similar 802.1x network config. When the client is connected to this different network it does not fail after 30 minutes of activity & I'm able to complete an iperf test.
I believe I've ruled out the RAP by migrating it back to the old 3600 controllers. In this setup, the APs & the Wifi configs are closer to what I'd ultimately like to move towards since they are the genesis of the configs that now exist on the new 7220s. When the Android client connects to the network running on the old 3600s, it does not fail & is able to complete its iperf test.
During these tests the CPPM server has remained constant. Only in the local cluster test did the service classification differ.
Now my questions are...
Has anyone experienced anything similar?
Does anyone have ANY suggestions about where I should go digging?
I started asking myself, what was cause a client's connection to fail - and thus began looking into where ever there might be a timeout variable set (aaa timers & aaa auth profiles), but they all looked normal.
I haven't done an exhaustive scrub of my configs but I'm fairly certain I've looked over the most obvious profiles & configurations. This is, after all, a new controller environment. Just about all of the network configuration, AAA server-group, AAA auth profiles, SSID profiles, etc. etc. was cleaned & optimized (either using ASE recommendations or best practices), & I don't believe I've strayed too much from the original configs on the 3600, just a few optimizations here & there.
Since I'm only testing w/ this RAP, I haven't had the need to expand any RF profiles & the RAP's configs on the new controllers are VERY similar to that on the old controllers, on which this issue does not occur.
In any case, thanks for any suggestions.
TIA,