Airheads Community

last person joined: 8 hours ago

Access network design for branch, remote, outdoor, and campus locations with HPE Aruba Networking access points and mobility controllers.

Back to discussions

Expand all | Collapse all

Datapath CPU 21 - Utilization above threshold

This thread has been viewed 52 times

1. Datapath CPU 21 - Utilization above threshold

5 Kudos
MVP

mharing
Posted May 25, 2016 02:54 PM

Reply Reply Privately
Good afternoon,

Looking for some insight here. I already have a TAC case opened, but was hoping someone on here might have experienced something like this before and could provide some information. We are seeing numerous CPU spikes on a daily basis all coming from the fpapps process. I know fpapps is used for L2/L3 connectivity to the controller, but don't know what could be causing these constant spikes. They happen at random intervals, but are always exactly 60 seconds, up until recently we had a few that were 180 seconds. I'm hoping someone can help, below are the most recent logs:

May 25 12:33:55 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:12%).
May 25 13:11:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:97%).
May 25 13:12:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:17%).
May 25 13:16:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:96%).
May 25 13:17:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:14%).
May 25 13:18:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:92%).
May 25 13:19:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).
May 25 13:21:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
May 25 13:24:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:13%).
May 25 13:26:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
May 25 13:29:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).
May 25 13:34:00 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
May 25 13:35:00 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).

Controller is 7220 with AOS 6.3.1.16. Handling about 512 APs and around 3700 clients. This appears to have been happening for at least a month before it was noticed, not currently affecting client or APs, but don't want to get to that point.

Thanks.
2. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
n.millward
Posted Mar 09, 2024 04:11 AM

Reply Reply Privately
Hello, don't suppose you have any recollection of how this datapath problem was rectified do you. Similar problem here, only the cpu stays above 90%.

Thanks.

------------------------------
Nathan
------------------------------

Original Message
3. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
EMPLOYEE

chulcher
Posted Mar 09, 2024 09:27 AM

Reply Reply Privately
Either a bug or you've got a lot of traffic going through the controller with the same source and destination.

Your best bet is going to be opening a case with TAC.

------------------------------
Carson Hulcher, ACEX#110
------------------------------

Original Message
4. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
n.millward
Posted Mar 10, 2024 04:32 AM

Reply Reply Privately
Thanks Carson.
For reference 3 clusters, 10 controllers, all doing the same thing at the exact same time. Friday around 4pm until this morning (Sunday) 2am. And then back to normal.
How would I spot traffic with the same source:destination on the controller? And what would do that?
I do have show tech-support from one cluster yesterday when the cpu was red-lining, and again just now when it's back down to normal. I opened a case with our partner yesterday morning.

Thanks.

Nathan.

------------------------------
Nathan
------------------------------

Original Message
5. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
EMPLOYEE

chulcher
Posted Mar 10, 2024 12:46 PM

Reply Reply Privately
For something that widespread, the traffic would almost have to be multicast. Do you have broadcast/multicast controls enabled on your WLANs? What about the interfaces?

------------------------------
Carson Hulcher, ACEX#110
------------------------------

Original Message
6. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
MVP

bosborne
Posted Mar 11, 2024 08:15 AM

Reply Reply Privately
Unless you are running AOS 6.3.x.x , I doubt any solution would apply to your case anyway. AOS 8.x.x was a complete greenfield rewrite.

------------------------------
Bruce Osborne ACCP ACMP
Liberty University

The views expressed here are my personal views and not those of my employer
------------------------------

Original Message
7. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
leo.ma
Posted Mar 11, 2024 09:06 AM

Reply Reply Privately
My AC 7005 at home sometimes has similar alarms, but it does not affect the network. Your version is too low. I think I should contact tac again after the upgrade. I look forward to your follow-up.

------------------------------
If my post was useful, please Accept Solution and Give Kudos.

leo ma

ACMX
------------------------------

Original Message
8. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
n.millward
Posted Mar 13, 2024 10:19 AM

Reply Reply Privately
Thanks for the replies.
MM is 8.10.0.10 LSR.
Clusters are now 8.10.0.8 LSR as we rolled back because this CPU problem first occurred on the .10 version. Still happens on the .8 version though.
I provided TAC witih show tech-support output while the problem was occurring. They had a look at that and requested the following output...when the problem is happening, so that's not much use until we hit the wall again!

"I have checked the logs, and I have noticed that we could see policed frames hitting the Datapath bwm table. If possible, when the issue happened at that time, please collect the below outputs four times in 30 seconds and send them to me.
Show cpuload current
show datapath utilization
show datapath bwm table
show datapath cp-bwm table
show firewall"

What we see on our RADIUS servers when the problem kicks off is that ordered, and pretty well balanced incoming RADIUS traffic from the controllers, goes all scruffy, until the CPU problem sorts itself out and then the orderliness returns. In the clip below, the RADIUS traffic goes wrong at the vertical line, which corresponds with the CPU problem and client auth difficulty. Normal service is not resumed in this image, it just got quieter over night.

Carson mentioned mcast and bcast traffic. This is what mcast looks like during the time of the issues we had over the last weekend (it's a 7 day chart so you can see what normal is)
Skewed to one interface in the aggregated link?

bcmc-optimization is off, default.
Upstream I've got to query other engineers for a lift there, but from I can pull together is no suppression. I'd like to see something in Huawei switch logs to indicate some kind of storm.
Keeping on looking. Things are OK presently. Will post-back if I hear more from TAC.

------------------------------
Nathan
------------------------------

Original Message
9. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
EMPLOYEE

chulcher
Posted Mar 13, 2024 11:07 AM

Reply Reply Privately
If you don't need to support a multicast application on the wireless network then I'd highly recommend looking at the BC/MC controls.

------------------------------
Carson Hulcher, ACEX#110
------------------------------

Original Message
10. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
n.millward
Posted Mar 15, 2024 08:42 AM

Reply Reply Privately
Had another cpu spike across all cluster yesterday, starting at 14:10. Caught it in real time as i was watching the upstream port-channel of a controller.

TAC also advised enabling bcmc on all VLANs except mgmt VLAN (as Carson). Started doing that whilst the problem was 'live' yesterday.

Applied bcmc to all client VLANs on controller 4 via the GUI, in two bacthes of changes. Controller 4 CPU 17 dropped back to normal when the second batch was applied.

Applied bcmc to client VLANs on controller 5 via the GUI, one VLAN at a time starting with the VLANs in the second batch as applied to controller 4. The controller 4 CPU 17 dropped back to normal when the change was applied to VLAN 2511.

Moved on to controller 6 (4,5,6 is one cluster) - applied bcmc via the GUI to VLAN 2511 first, expecting to see the CPU drop – there was no change. Applied bcmc to all client VLANs on controller 6. The CPU stayed at 90%. We restarted controller 6, and the CPU went straight back to 90% after booting, and remained at that level.

Have now applied bcmc to all controllers (10) via the folder level hierarchical config. The CPU on the remaining controllers has remained at 90%. bcmc had no effect.

Collected a port mirror pcap of all in/out traffic from one controller during the cpu spike- very few retransmits observed (no worse than when the cpu is normal).

At 22:20 in the evening, CPU 17 dropped back to normal.

Spoke with TAC again this morning (cpu currently fine)
TAC want:

show processes sort-by cpu
show processes sort-by memory
show cp-bwcontracts
show datapath message-queue counters
show datapath debug dma counters
show datapath maintenance counters
show firewall
show cpuload current x 4
show datapath utilization x 4
show datapath bwm table x 4
show datapath cp-bwm table x 4

+ pcap filtered for tcp retrans

+ uplink images from monitoring server.

Struggling to find any cause on the network presently. The common link is the aggregation router to which all controllers connect. All VLANs, including mgmt VLAN are delivered to the controllers down the port channel. We pondered breaking that model up and separating the mgmt VLAN from the client VLANs to see if that would shed any light. But I'm not entirely sure how disruptive that is going to be.

Waiting game again now. Will post back after the next event.

------------------------------
Nathan
------------------------------

Original Message
11. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
n.millward
Posted 21 days ago

Reply Reply Privately
I can bring this thread that I hi-jacked to a close now.

As Carson posted earlier about it being a good idea to apply bcmc to all client VLANs, it would appear that his change has sorted it.
We did have another max cpu event across all controllers, and it looks like that was caused because I missed bcmc off a couple of VLANs. Once applied, and the cpu had calmed down (not immediate, it took an aggregation switch reboot to force a drop of all traffic early one morning) we've not seen the problem since.

No idea how we managed to run the infrastructure without bcmc for 3 and half years without an incident, and then suddenly got repeatedly hammered by it from Feb 20th this year. I can't get my head round that. Can't understand why it's not on as default either, nor why the integrator who set us up in the very first instance didn't set it to on. This is certainly the kind of experience you don't forget in a hurry, and so would simply do as a matter of course in any new installations.
Over and out.

------------------------------
Nathan
------------------------------

Original Message
12. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
EMPLOYEE

chulcher
Posted 21 days ago

Reply Reply Privately
Glad to hear you resolved the issue. I'd look at other changes in the network or services used, see if there was a change to enable multicast on the wired side or some multicast application was implemented. If you've got any kind of application/traffic monitoring in place, that might be helpful.

------------------------------
Carson Hulcher, ACEX#110
------------------------------

Original Message
13. RE: Datapath CPU 21 - Utilization above threshold

0 Kudos
MVP

bosborne
Posted 21 days ago

Reply Reply Privately
We were involved with the initial implementation of BCMC optimization due to our use of multicast IPTV at that time. Perhaps there have been some changes in Apple devices performing more bonjour related multicast traffic.

I would suspect BCMC technically is not in the Wi-Fi standard so would not be the default, for certification purposes.

------------------------------
Bruce Osborne ACCP ACMP
Liberty University

The views expressed here are my personal views and not those of my employer
------------------------------

Original Message

Wireless Access

Datapath CPU 21 - Utilization above threshold

1. Datapath CPU 21 - Utilization above threshold

2. RE: Datapath CPU 21 - Utilization above threshold

3. RE: Datapath CPU 21 - Utilization above threshold

4. RE: Datapath CPU 21 - Utilization above threshold

5. RE: Datapath CPU 21 - Utilization above threshold

6. RE: Datapath CPU 21 - Utilization above threshold

7. RE: Datapath CPU 21 - Utilization above threshold

8. RE: Datapath CPU 21 - Utilization above threshold

9. RE: Datapath CPU 21 - Utilization above threshold

10. RE: Datapath CPU 21 - Utilization above threshold

11. RE: Datapath CPU 21 - Utilization above threshold

12. RE: Datapath CPU 21 - Utilization above threshold

13. RE: Datapath CPU 21 - Utilization above threshold