Airheads Community

5. RE: Datapath CPU 21 - Utilization above threshold

Kudos

EMPLOYEE

chulcher

Posted Mar 10, 2024 12:46 PM

For something that widespread, the traffic would almost have to be multicast. Do you have broadcast/multicast controls enabled on your WLANs? What about the interfaces?

------------------------------
Carson Hulcher, ACEX#110
------------------------------

Original Message

10. RE: Datapath CPU 21 - Utilization above threshold

Kudos

n.millward

Posted Mar 15, 2024 08:42 AM

Had another cpu spike across all cluster yesterday, starting at 14:10. Caught it in real time as i was watching the upstream port-channel of a controller.

TAC also advised enabling bcmc on all VLANs except mgmt VLAN (as Carson). Started doing that whilst the problem was 'live' yesterday.

Applied bcmc to all client VLANs on controller 4 via the GUI, in two bacthes of changes. Controller 4 CPU 17 dropped back to normal when the second batch was applied.

Applied bcmc to client VLANs on controller 5 via the GUI, one VLAN at a time starting with the VLANs in the second batch as applied to controller 4. The controller 4 CPU 17 dropped back to normal when the change was applied to VLAN 2511.

Moved on to controller 6 (4,5,6 is one cluster) - applied bcmc via the GUI to VLAN 2511 first, expecting to see the CPU drop – there was no change. Applied bcmc to all client VLANs on controller 6. The CPU stayed at 90%. We restarted controller 6, and the CPU went straight back to 90% after booting, and remained at that level.

Have now applied bcmc to all controllers (10) via the folder level hierarchical config. The CPU on the remaining controllers has remained at 90%. bcmc had no effect.

Collected a port mirror pcap of all in/out traffic from one controller during the cpu spike- very few retransmits observed (no worse than when the cpu is normal).

At 22:20 in the evening, CPU 17 dropped back to normal.

Spoke with TAC again this morning (cpu currently fine)
TAC want:

show processes sort-by cpu
show processes sort-by memory
show cp-bwcontracts
show datapath message-queue counters
show datapath debug dma counters
show datapath maintenance counters
show firewall
show cpuload current x 4
show datapath utilization x 4
show datapath bwm table x 4
show datapath cp-bwm table x 4

+ pcap filtered for tcp retrans

+ uplink images from monitoring server.

Struggling to find any cause on the network presently. The common link is the aggregation router to which all controllers connect. All VLANs, including mgmt VLAN are delivered to the controllers down the port channel. We pondered breaking that model up and separating the mgmt VLAN from the client VLANs to see if that would shed any light. But I'm not entirely sure how disruptive that is going to be.

Waiting game again now. Will post back after the next event.

------------------------------
Nathan
------------------------------

Original Message

Original Message:
Sent: Mar 13, 2024 11:06 AM
From: chulcher
Subject: Datapath CPU 21 - Utilization above threshold

If you don't need to support a multicast application on the wireless network then I'd highly recommend looking at the BC/MC controls.

------------------------------
Carson Hulcher, ACEX#110

Original Message:
Sent: Mar 13, 2024 10:18 AM
From: n.millward
Subject: Datapath CPU 21 - Utilization above threshold

Thanks for the replies.
MM is 8.10.0.10 LSR.
Clusters are now 8.10.0.8 LSR as we rolled back because this CPU problem first occurred on the .10 version. Still happens on the .8 version though.
I provided TAC witih show tech-support output while the problem was occurring. They had a look at that and requested the following output...when the problem is happening, so that's not much use until we hit the wall again!

"I have checked the logs, and I have noticed that we could see policed frames hitting the Datapath bwm table. If possible, when the issue happened at that time, please collect the below outputs four times in 30 seconds and send them to me.
Show cpuload current
show datapath utilization
show datapath bwm table
show datapath cp-bwm table
show firewall"

What we see on our RADIUS servers when the problem kicks off is that ordered, and pretty well balanced incoming RADIUS traffic from the controllers, goes all scruffy, until the CPU problem sorts itself out and then the orderliness returns. In the clip below, the RADIUS traffic goes wrong at the vertical line, which corresponds with the CPU problem and client auth difficulty. Normal service is not resumed in this image, it just got quieter over night.

Carson mentioned mcast and bcast traffic. This is what mcast looks like during the time of the issues we had over the last weekend (it's a 7 day chart so you can see what normal is)
Skewed to one interface in the aggregated link?

bcmc-optimization is off, default.
Upstream I've got to query other engineers for a lift there, but from I can pull together is no suppression. I'd like to see something in Huawei switch logs to indicate some kind of storm.
Keeping on looking. Things are OK presently. Will post-back if I hear more from TAC.

------------------------------
Nathan

Original Message:
Sent: Mar 11, 2024 09:06 AM
From: leo.ma
Subject: Datapath CPU 21 - Utilization above threshold

My AC 7005 at home sometimes has similar alarms, but it does not affect the network. Your version is too low. I think I should contact tac again after the upgrade. I look forward to your follow-up.

------------------------------
If my post was useful, please Accept Solution and Give Kudos.

leo ma

ACMX

Original Message:
Sent: May 25, 2016 02:53 PM
From: mharing
Subject: Datapath CPU 21 - Utilization above threshold

Good afternoon,

Looking for some insight here. I already have a TAC case opened, but was hoping someone on here might have experienced something like this before and could provide some information. We are seeing numerous CPU spikes on a daily basis all coming from the fpapps process. I know fpapps is used for L2/L3 connectivity to the controller, but don't know what could be causing these constant spikes. They happen at random intervals, but are always exactly 60 seconds, up until recently we had a few that were 180 seconds. I'm hoping someone can help, below are the most recent logs:

May 25 12:33:55 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:12%).
May 25 13:11:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:97%).
May 25 13:12:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:17%).
May 25 13:16:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:96%).
May 25 13:17:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:14%).
May 25 13:18:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:92%).
May 25 13:19:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).
May 25 13:21:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
May 25 13:24:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:13%).
May 25 13:26:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
May 25 13:29:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).
May 25 13:34:00 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
May 25 13:35:00 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).

Controller is 7220 with AOS 6.3.1.16. Handling about 512 APs and around 3700 clients. This appears to have been happening for at least a month before it was noticed, not currently affecting client or APs, but don't want to get to that point.

Thanks.

11. RE: Datapath CPU 21 - Utilization above threshold

Kudos

n.millward

Posted Apr 04, 2024 11:33 AM

I can bring this thread that I hi-jacked to a close now.

As Carson posted earlier about it being a good idea to apply bcmc to all client VLANs, it would appear that his change has sorted it.
We did have another max cpu event across all controllers, and it looks like that was caused because I missed bcmc off a couple of VLANs. Once applied, and the cpu had calmed down (not immediate, it took an aggregation switch reboot to force a drop of all traffic early one morning) we've not seen the problem since.

No idea how we managed to run the infrastructure without bcmc for 3 and half years without an incident, and then suddenly got repeatedly hammered by it from Feb 20th this year. I can't get my head round that. Can't understand why it's not on as default either, nor why the integrator who set us up in the very first instance didn't set it to on. This is certainly the kind of experience you don't forget in a hurry, and so would simply do as a matter of course in any new installations.
Over and out.

------------------------------
Nathan
------------------------------

Original Message

Original Message:
Sent: Mar 15, 2024 08:42 AM
From: n.millward
Subject: Datapath CPU 21 - Utilization above threshold