Wireless Access

Reply
MVP

CPU load at 99%

So I have a customer whose controller is being monitored by IMC.

In about a week we've now had warnings twice about a CPU load of 99%... scary!

 

When I log in however the controller has 80+ percent of idle time at the same time that IMC hands out warnings of >90% .CPU usage.

 

Apparantly IMC is monitoring 32 "cpu". Evidently these are the individual cores. Both times we had >90% warnings it seems related to a single core (#27).

cpu27.png

 

So it appears we have a single core of our 7220 controller that is 'regularly' (as in exactly 1 week between the 2 events)  going up to 99% usage.

The only process at that time that had some cpu usage is 'stm' (only ~20% though).

Cpuload per-cpu is fine too, all cpu's had >80% idle time during this 'high load' event.

 

So.. is there any way to further investigate what is going on wiuth that single core? Or is this simply normal behaviour and should our IMC guy monitor CPU's instead of cores?

Any insight is appreciated here..

Koen (ACMX #351 | ACDX #547 | ACCP)

-- Found something helpful, important, or cool? Click the Kudos Star in a post.
-- Problem Solved? Click "Accept as Solution" in a post.

Re: CPU load at 99%

There are some threads within the AOS that are single threaded so you can have some processes that can saturate a single core (CLI is one). So unless it's something systemic where multiple cores are hitting 100% I wouldn't worry. 

Jerrod Howard
Sr. Technical Marketing Engineer
Frequent Contributor I

Re: CPU load at 99%

hi Koen

Debugging a single core is hard - high core CPU (anything > 7 is not related to the CLI/arubaos itself) is generally down to what a single user might be up to, or, a single user is the receipient of something that is coming from the network. Examples that can cause this are a single user flooding broadcast packets out, heavy subnet scanning or perhaps the user has arp spoofed the default gateway and is now sinking all the traffic.

 

Having a core at 100% for any sustained amount of time will usually impact others using that core (latency, aaa delays, packetloss) - there is no easy way to determine what tunnels/users are on that core, so you need to make some deductions from various datapath debug commands to try and narrow down the type of traffic that may be the cause (and then hopefully you can deduce where it's coming from)

 

Some example commands to use during the issue which can also be used outside the issue to baseline - note that most of these are in a techsupport log but it's also important to see the value deltas when running them during the issue. Please determine the risk/output volume before running these (e.g. show datapath user is voluminous)

 

> find the cpu core, is it holding 99% util across all three time buckets? Are any of the CPUs in the range 8-11 also showing an uptick during the issue ? 

show datapath utilization

 

> run twice with 5 seconds in between to collect delta. Look for excessive drops, flood frames, anything that looks bad.

show datapath frame verbose

 

 

> collect some stats from the affected cpu (e.g. 27), run twice (at least) with 5 seconds in between. Look for 'Allocated Frames' vs. 'max Allocated Frames'. Ignore the 'discard' stats, check for high rates of flood frames during the issue.
show datapath frame 27

 

> collect the errors (mostly a subet of the previous output) 

show datapath error counters

 

> if there is an uptick in cpus 8-11 then see if any of the opcodes in the following are increasing more rapidly than usual. Collect this twice, 5 seconds apart - it can help identify a particular type of traffic that may be causing the high CPU.

show datapath message-queue counters

 

> depending if you have BWM contracts, which in themselves shouldnt cause high CPU, always a good idea to collect this twice, 5 seconds apart to ascertain how BWM is holding up

show datapath maintenance counters

 

> look for any users consuming a massive number of session counters (indicative of port scanning etc.). Warning - very voluminous output

show datapath user

show datapath user counters

show datapath user 27

 

> Also unfortunately a lengthy output - if all of the above fail to give a clue, you may need to try and narrow down the AP that hosts the user which is causing the high CPU. It's a bit manual, but you need to run the following command a few times, to try and visually identify which tunnel is consuming a large amount of traffic. In the case of the high CPU, one of them usually sticks out. 

show datapath tunnel verbose

 

If you find one that sticks out, then you can find the BSSID and hence dump the user table and datapath user and filter against that BSSID to narrow down who is on the AP - later when the issue happens again, you can repeat and compare if the same user(s) are back. You can also start to inspect the 'show datapath session table <ip of user>' to see what they are doing or dump the whole 'show datapath session table' and filter it against the tunnel id.

 

hope that's useful.

 

 

 

 

Search Airheads
cancel
Showing results for 
Search instead for 
Did you mean: