Wireless Access

last person joined: yesterday 

Access network design for branch, remote, outdoor, and campus locations with HPE Aruba Networking access points and mobility controllers.
Expand all | Collapse all

Network performance issues when users switch APs

This thread has been viewed 5 times
  • 1.  Network performance issues when users switch APs

    Posted Oct 24, 2017 12:35 PM

    We seem to be having some major network performance issues on our campus when classes switch at the top of the hour.  

     

    10am-3pm on the hour mark, ping times to physical interfaces on all the APs and controllers will climb to >500ms with sometimes significant packet loss.  Of course, this severely impacts wireless performance, most notably in failed authentications.  In about 15 minutes, things calm down again.  

     

    The thing is, I can't find any indication of a performance issue on the controller or the APs. Everything looks fine from that perspective, but our monitoring software is freaking out.  

     

    I've been able to isolate it to an aruba problem. We have numerous other devices on the subnets involved, they never show any indication of a problem.  Additionally, our hot spare controller does not exhibit this problem. 

     

    Any ideas where to start looking for the problem?



  • 2.  RE: Network performance issues when users switch APs

    EMPLOYEE
    Posted Oct 24, 2017 01:41 PM

    If you have access points in hallways, it is the greatest possibility that the number of users that have visibility to those access points will attach to them.  If you have Airwave, I would look at the number of users on each access point at the times you would have problems and see if the number spikes.



  • 3.  RE: Network performance issues when users switch APs

    Posted Oct 24, 2017 04:25 PM

    Colin, thanks for the quick response. The OP and I work together.

     

    We generally avoid AP's in hallways unless the professional survey we had done says to put an AP in a hallway, which is rare.

     

    We've not had this problem before this year and we might have added 10 or 15 access points in that time. Between last year and this year we made a handful of changes recommended by an ACE engineer and also upgraded from AOS 6.4 to AOS 6.5 to support AP-300 series AP's and tunneled node on the new Aruba/HPE switches.

     

    Recommended changes are:

    • Reduce ARM EIRP to 9 for 2.4GHz and 15 for 5GHz
      • Used to be 6 to 12 for 2.4GHz and 12 to 18 for 5GHz
    • Set wireless A and G beacon rates to 12.
      • This was prevoiusly not set on any SSID.
    • Set max clients per SSID to 64
      • Used to be 150 for our most-used SSID, no limit on our second most-used SSID
    • Set basic and transmit rates for A and G to 12 and 24 for basic rates, 12, 24, 36, 48, 54 for transmit rates for all SSID's
      • Most-used SSID used to have 12, 18, 24 for A basic rates, 12, 18, 24, 36, 48, 54 for A transmit rates, 9, 11, 12 for G basic rates, and 9, 11, 12, 18, 24, 36, 48, and 54 for transmit rates.
      • Second most-used SSID used to have no limit on A basic and transmit rates, G basic rates of 2, 5, 6, 9, 11, 12, and G transmit rates of 2, 5, 6, 9, 11, 12, 18, 24, 36, 48, 54.

    I'm debating rolling back the power level changes, the max client associations change, and the basic and transmit rate changes one by one to see what effect they have on the environment.

     

    "Most used SSID" has about 5,000 clients right now and uses WPA2-Enterprise, "second most used SSID" has about 1,000 clients right now and is open/MAC auth, generally for things not smart enough to support WPA2-Enterprise.

     

    Our SE also suggested setting up AP fast failover as a mechanism to mitigate the impact of these issues.

     

    We also seem to have this issue the most when class gets out and people begin moving around the campus a lot, although it persists from about 10am every day to about 2:30pm or 3pm every day.

     

    We do also have a TAC case open and our SE is involved in the case as well, we're just wondering if the community has any other input to share.

     

    Thanks for your help again, Colin!



  • 4.  RE: Network performance issues when users switch APs

    EMPLOYEE
    Posted Oct 24, 2017 04:37 PM

    The key to understanding this specific problem is:

     

    - Set a metric to understand where you are in the problem

    - Make changes and consult your metric to see how you are doing.

     

    In this situation, the key metric is Rf utilization.  If you have Airwave measuring RF utilization over time, you should make no more than a single change per day and measure what the RF utilization is (1) at midnight and (2) when you have the problem (change in class period).  If you are going in the right direction, RF utilization should be the same or lower at midnight.  It also should not be greater at the change in period.

     

    Since you already made quite a bit of changes, I will only offer the bit on how to measure how you are doing.  The worst thing is if someone comes in after with a bunch of suggestions and muddies the waters, and makes things worse...



  • 5.  RE: Network performance issues when users switch APs
    Best Answer

    Posted Oct 25, 2017 12:04 AM

    if this below quote is correct, then this issue seems unrelated to any RF parameters you changed

     

    "10am-3pm on the hour mark, ping times to physical interfaces on all the APs and controllers will climb to >500ms with sometimes significant packet loss.  Of course, this severely impacts wireless performance, most notably in failed authentications.  In about 15 minutes, things calm down again"

     

    controller ping time climbing from a wired monitoring location is a very bad thing, and IMHO unrelated to any of the changes you listed. 

     

    bandwidth contracts, datapath CPU and broadcast/flooding storms start to come to mind.  Perhaps bcast storms are ruled out due to other things you mentioned (other hosts on subnet not having a problem), but it's worth considering. During the issue, look into the datapath CPU util (show datapath util) and look into the "show datapath frame". 

     

    As for the APs, try and run the "show datapath frame ap-name <the ap>" at the time of the issue, also "show . You may try to run "show ap debug system-status" but beware it could lock up your CLI for an extended time as it takes a long time to timeout if the AP is not responsive.

     

    Things to look at in DP frame are "flood frames" increasing rapidly and "allocated frames" holding a very high number (10k +)  but post it back here if you like.



  • 6.  RE: Network performance issues when users switch APs

    Posted Oct 26, 2017 08:52 AM

    Thanks dugem2016! 

     

    Datapath CPU utilization is indeed the metric we're maxing out. During high latency periods, the CPU usage on one of the CPUs climbs to and stays at 100%.  

     

    We're working with TAC to identify the cause of this CPU usage.