Wireless Access

last person joined: 17 hours ago 

Access network design for branch, remote, outdoor, and campus locations with HPE Aruba Networking access points and mobility controllers.
Expand all | Collapse all

CPU load at 99%

This thread has been viewed 12 times
  • 1.  CPU load at 99%

    MVP
    Posted Jul 31, 2017 11:07 AM

    So I have a customer whose controller is being monitored by IMC.

    In about a week we've now had warnings twice about a CPU load of 99%... scary!

     

    When I log in however the controller has 80+ percent of idle time at the same time that IMC hands out warnings of >90% .CPU usage.

     

    Apparantly IMC is monitoring 32 "cpu". Evidently these are the individual cores. Both times we had >90% warnings it seems related to a single core (#27).

    cpu27.png

     

    So it appears we have a single core of our 7220 controller that is 'regularly' (as in exactly 1 week between the 2 events)  going up to 99% usage.

    The only process at that time that had some cpu usage is 'stm' (only ~20% though).

    Cpuload per-cpu is fine too, all cpu's had >80% idle time during this 'high load' event.

     

    So.. is there any way to further investigate what is going on wiuth that single core? Or is this simply normal behaviour and should our IMC guy monitor CPU's instead of cores?

    Any insight is appreciated here..



  • 2.  RE: CPU load at 99%

    EMPLOYEE
    Posted Jul 31, 2017 11:21 AM

    There are some threads within the AOS that are single threaded so you can have some processes that can saturate a single core (CLI is one). So unless it's something systemic where multiple cores are hitting 100% I wouldn't worry. 



  • 3.  RE: CPU load at 99%

    Posted Aug 05, 2017 07:42 AM

    hi Koen

    Debugging a single core is hard - high core CPU (anything > 7 is not related to the CLI/arubaos itself) is generally down to what a single user might be up to, or, a single user is the receipient of something that is coming from the network. Examples that can cause this are a single user flooding broadcast packets out, heavy subnet scanning or perhaps the user has arp spoofed the default gateway and is now sinking all the traffic.

     

    Having a core at 100% for any sustained amount of time will usually impact others using that core (latency, aaa delays, packetloss) - there is no easy way to determine what tunnels/users are on that core, so you need to make some deductions from various datapath debug commands to try and narrow down the type of traffic that may be the cause (and then hopefully you can deduce where it's coming from)

     

    Some example commands to use during the issue which can also be used outside the issue to baseline - note that most of these are in a techsupport log but it's also important to see the value deltas when running them during the issue. Please determine the risk/output volume before running these (e.g. show datapath user is voluminous)

     

    > find the cpu core, is it holding 99% util across all three time buckets? Are any of the CPUs in the range 8-11 also showing an uptick during the issue ? 

    show datapath utilization

     

    > run twice with 5 seconds in between to collect delta. Look for excessive drops, flood frames, anything that looks bad.

    show datapath frame verbose

     

     

    > collect some stats from the affected cpu (e.g. 27), run twice (at least) with 5 seconds in between. Look for 'Allocated Frames' vs. 'max Allocated Frames'. Ignore the 'discard' stats, check for high rates of flood frames during the issue.
    show datapath frame 27

     

    > collect the errors (mostly a subet of the previous output) 

    show datapath error counters

     

    > if there is an uptick in cpus 8-11 then see if any of the opcodes in the following are increasing more rapidly than usual. Collect this twice, 5 seconds apart - it can help identify a particular type of traffic that may be causing the high CPU.

    show datapath message-queue counters

     

    > depending if you have BWM contracts, which in themselves shouldnt cause high CPU, always a good idea to collect this twice, 5 seconds apart to ascertain how BWM is holding up

    show datapath maintenance counters

     

    > look for any users consuming a massive number of session counters (indicative of port scanning etc.). Warning - very voluminous output

    show datapath user

    show datapath user counters

    show datapath user 27

     

    > Also unfortunately a lengthy output - if all of the above fail to give a clue, you may need to try and narrow down the AP that hosts the user which is causing the high CPU. It's a bit manual, but you need to run the following command a few times, to try and visually identify which tunnel is consuming a large amount of traffic. In the case of the high CPU, one of them usually sticks out. 

    show datapath tunnel verbose

     

    If you find one that sticks out, then you can find the BSSID and hence dump the user table and datapath user and filter against that BSSID to narrow down who is on the AP - later when the issue happens again, you can repeat and compare if the same user(s) are back. You can also start to inspect the 'show datapath session table <ip of user>' to see what they are doing or dump the whole 'show datapath session table' and filter it against the tunnel id.

     

    hope that's useful.