"show datapath bwm" at intervals will tell you what traffic is being policed, so you can get an idea what is being overrun.
"show datapath debug dma" at intervals will let you know if the control plane or Network Processor are being overrun, and at what times.
Original Message:
Sent: Jun 05, 2024 08:44 AM
From: mharing
Subject: Datapath CPU 21 - Utilization above threshold
Ironic, as I started this thread many years ago and I am now running into the SAME issue as you are in the 8.10 AOS version. We have a cluster of 4, only 2 seem to be affected by it, but almost every evening between 2-3AM, we get CPU utilization (different CPUs this time) hitting 100%. The 2 affected are on the same core network, all of our client VLANs are L2 between the 4 mobility gateways, including the management VLAN as well.
This started approx 3 weeks ago out of the blue and so far as been isolated to just this location and these 2 of 4 gateways. We initially found odd IPs in the user-table that don't belong, thought possibly a loop. Modified our validuser ACL to allow ONLY the correct client subnets at that site, did not resolve. I've also enabled BCMC Optimization on a couple client VLANs, not all, but plan to do the rest now. Finally, I am in the process of enabling "Enforce DHCP" on our Wi-Fi networks to prevent someone from purposefully or accidentally duplicating IPs (especially our default gateway).
TAC requested I provide the same details, but unfortunately my physical location in relation to these devices and our topology makes it slightly more difficult (fiber-only uplinks, so a SPAN is a little more challenging without the right SFPs).
I think I'll plan to start deploying BCMC optimization to remaining client VLANs and see if that helps settle this issue down. That cluster is running 8.10.0.6, Conductor is on 8.10.0.12 and we will be upgrading the cluster to match shortly as well.
Thanks for the the info & your findings!
------------------------------
Michael Haring
Original Message:
Sent: Apr 04, 2024 11:32 AM
From: n.millward
Subject: Datapath CPU 21 - Utilization above threshold
I can bring this thread that I hi-jacked to a close now.
As Carson posted earlier about it being a good idea to apply bcmc to all client VLANs, it would appear that his change has sorted it.
We did have another max cpu event across all controllers, and it looks like that was caused because I missed bcmc off a couple of VLANs. Once applied, and the cpu had calmed down (not immediate, it took an aggregation switch reboot to force a drop of all traffic early one morning) we've not seen the problem since.
No idea how we managed to run the infrastructure without bcmc for 3 and half years without an incident, and then suddenly got repeatedly hammered by it from Feb 20th this year. I can't get my head round that. Can't understand why it's not on as default either, nor why the integrator who set us up in the very first instance didn't set it to on. This is certainly the kind of experience you don't forget in a hurry, and so would simply do as a matter of course in any new installations.
Over and out.
------------------------------
Nathan
Original Message:
Sent: Mar 15, 2024 08:42 AM
From: n.millward
Subject: Datapath CPU 21 - Utilization above threshold
Had another cpu spike across all cluster yesterday, starting at 14:10. Caught it in real time as i was watching the upstream port-channel of a controller.
TAC also advised enabling bcmc on all VLANs except mgmt VLAN (as Carson). Started doing that whilst the problem was 'live' yesterday.
Applied bcmc to all client VLANs on controller 4 via the GUI, in two bacthes of changes. Controller 4 CPU 17 dropped back to normal when the second batch was applied.
Applied bcmc to client VLANs on controller 5 via the GUI, one VLAN at a time starting with the VLANs in the second batch as applied to controller 4. The controller 4 CPU 17 dropped back to normal when the change was applied to VLAN 2511.
Moved on to controller 6 (4,5,6 is one cluster) - applied bcmc via the GUI to VLAN 2511 first, expecting to see the CPU drop – there was no change. Applied bcmc to all client VLANs on controller 6. The CPU stayed at 90%. We restarted controller 6, and the CPU went straight back to 90% after booting, and remained at that level.
Have now applied bcmc to all controllers (10) via the folder level hierarchical config. The CPU on the remaining controllers has remained at 90%. bcmc had no effect.
Collected a port mirror pcap of all in/out traffic from one controller during the cpu spike- very few retransmits observed (no worse than when the cpu is normal).
At 22:20 in the evening, CPU 17 dropped back to normal.
Spoke with TAC again this morning (cpu currently fine)
TAC want:
show processes sort-by cpu
show processes sort-by memory
show cp-bwcontracts
show datapath message-queue counters
show datapath debug dma counters
show datapath maintenance counters
show firewall
show cpuload current x 4
show datapath utilization x 4
show datapath bwm table x 4
show datapath cp-bwm table x 4
+ pcap filtered for tcp retrans
+ uplink images from monitoring server.
Struggling to find any cause on the network presently. The common link is the aggregation router to which all controllers connect. All VLANs, including mgmt VLAN are delivered to the controllers down the port channel. We pondered breaking that model up and separating the mgmt VLAN from the client VLANs to see if that would shed any light. But I'm not entirely sure how disruptive that is going to be.
Waiting game again now. Will post back after the next event.
------------------------------
Nathan
Original Message:
Sent: Mar 13, 2024 11:06 AM
From: chulcher
Subject: Datapath CPU 21 - Utilization above threshold
If you don't need to support a multicast application on the wireless network then I'd highly recommend looking at the BC/MC controls.
------------------------------
Carson Hulcher, ACEX#110
Original Message:
Sent: Mar 13, 2024 10:18 AM
From: n.millward
Subject: Datapath CPU 21 - Utilization above threshold
Thanks for the replies.
MM is 8.10.0.10 LSR.
Clusters are now 8.10.0.8 LSR as we rolled back because this CPU problem first occurred on the .10 version. Still happens on the .8 version though.
I provided TAC witih show tech-support output while the problem was occurring. They had a look at that and requested the following output...when the problem is happening, so that's not much use until we hit the wall again!
"I have checked the logs, and I have noticed that we could see policed frames hitting the Datapath bwm table. If possible, when the issue happened at that time, please collect the below outputs four times in 30 seconds and send them to me.
Show cpuload current
show datapath utilization
show datapath bwm table
show datapath cp-bwm table
show firewall"
What we see on our RADIUS servers when the problem kicks off is that ordered, and pretty well balanced incoming RADIUS traffic from the controllers, goes all scruffy, until the CPU problem sorts itself out and then the orderliness returns. In the clip below, the RADIUS traffic goes wrong at the vertical line, which corresponds with the CPU problem and client auth difficulty. Normal service is not resumed in this image, it just got quieter over night.
Carson mentioned mcast and bcast traffic. This is what mcast looks like during the time of the issues we had over the last weekend (it's a 7 day chart so you can see what normal is)
Skewed to one interface in the aggregated link?
bcmc-optimization is off, default.
Upstream I've got to query other engineers for a lift there, but from I can pull together is no suppression. I'd like to see something in Huawei switch logs to indicate some kind of storm.
Keeping on looking. Things are OK presently. Will post-back if I hear more from TAC.
------------------------------
Nathan
Original Message:
Sent: Mar 11, 2024 09:06 AM
From: leo.ma
Subject: Datapath CPU 21 - Utilization above threshold
My AC 7005 at home sometimes has similar alarms, but it does not affect the network. Your version is too low. I think I should contact tac again after the upgrade. I look forward to your follow-up.
------------------------------
If my post was useful, please Accept Solution and Give Kudos.
leo ma
ACMX
Original Message:
Sent: May 25, 2016 02:53 PM
From: mharing
Subject: Datapath CPU 21 - Utilization above threshold
Good afternoon,
Looking for some insight here. I already have a TAC case opened, but was hoping someone on here might have experienced something like this before and could provide some information. We are seeing numerous CPU spikes on a daily basis all coming from the fpapps process. I know fpapps is used for L2/L3 connectivity to the controller, but don't know what could be causing these constant spikes. They happen at random intervals, but are always exactly 60 seconds, up until recently we had a few that were 180 seconds. I'm hoping someone can help, below are the most recent logs:
May 25 12:33:55 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:12%).
May 25 13:11:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:97%).
May 25 13:12:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:17%).
May 25 13:16:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:96%).
May 25 13:17:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:14%).
May 25 13:18:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:92%).
May 25 13:19:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).
May 25 13:21:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
May 25 13:24:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:13%).
May 25 13:26:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
May 25 13:29:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).
May 25 13:34:00 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
May 25 13:35:00 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).
Controller is 7220 with AOS 6.3.1.16. Handling about 512 APs and around 3700 clients. This appears to have been happening for at least a month before it was noticed, not currently affecting client or APs, but don't want to get to that point.
Thanks.