Wireless Access

last person joined: 8 hours ago 

Access network design for branch, remote, outdoor, and campus locations with HPE Aruba Networking access points and mobility controllers.
Expand all | Collapse all

Datapath CPU 21 - Utilization above threshold

This thread has been viewed 52 times
  • 1.  Datapath CPU 21 - Utilization above threshold

    MVP
    Posted May 25, 2016 02:54 PM

    Good afternoon,

     

    Looking for some insight here. I already have a TAC case opened, but was hoping someone on here might have experienced something like this before and could provide some information. We are seeing numerous CPU spikes on a daily basis all coming from the fpapps process. I know fpapps is used for L2/L3 connectivity to the controller, but don't know what could be causing these constant spikes. They happen at random intervals, but are always exactly 60 seconds, up until recently we had a few that were 180 seconds. I'm hoping someone can help, below are the most recent logs:

     

    May 25 12:33:55 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:12%).
    May 25 13:11:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:97%).
    May 25 13:12:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:17%).
    May 25 13:16:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:96%).
    May 25 13:17:58 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:14%).
    May 25 13:18:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:92%).
    May 25 13:19:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).
    May 25 13:21:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
    May 25 13:24:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:13%).
    May 25 13:26:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
    May 25 13:29:59 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).
    May 25 13:34:00 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has exceeded 70% threshold (actual:99%).
    May 25 13:35:00 :399838: <WARN> |fpapps| Resource 'Datapath CPU 21' has dropped below 70% threshold (actual:9%).

     

     

    Controller is 7220 with AOS 6.3.1.16. Handling about 512 APs and around 3700 clients. This appears to have been happening for at least a month before it was noticed, not currently affecting client or APs, but don't want to get to that point.

     

    Thanks.



  • 2.  RE: Datapath CPU 21 - Utilization above threshold

    Posted Mar 09, 2024 04:11 AM

    Hello, don't suppose you have any recollection of how this datapath problem was rectified do you. Similar problem here, only the cpu stays above 90%.

    Thanks.



    ------------------------------
    Nathan
    ------------------------------



  • 3.  RE: Datapath CPU 21 - Utilization above threshold

    EMPLOYEE
    Posted Mar 09, 2024 09:27 AM

    Either a bug or you've got a lot of traffic going through the controller with the same source and destination.

    Your best bet is going to be opening a case with TAC.



    ------------------------------
    Carson Hulcher, ACEX#110
    ------------------------------



  • 4.  RE: Datapath CPU 21 - Utilization above threshold

    Posted Mar 10, 2024 04:32 AM

    Thanks Carson.
    For reference 3 clusters, 10 controllers, all doing the same thing at the exact same time. Friday around 4pm until this morning (Sunday) 2am. And then back to normal.
    How would I spot traffic with the same source:destination on the controller? And what would do that?
    I do have show tech-support from one cluster yesterday when the cpu was red-lining, and again just now when it's back down to normal. I opened a case with our partner yesterday morning.

    Thanks.

    Nathan.



    ------------------------------
    Nathan
    ------------------------------



  • 5.  RE: Datapath CPU 21 - Utilization above threshold

    EMPLOYEE
    Posted Mar 10, 2024 12:46 PM

    For something that widespread, the traffic would almost have to be multicast.  Do you have broadcast/multicast controls enabled on your WLANs?  What about the interfaces?



    ------------------------------
    Carson Hulcher, ACEX#110
    ------------------------------



  • 6.  RE: Datapath CPU 21 - Utilization above threshold

    MVP
    Posted Mar 11, 2024 08:15 AM

    Unless you are running AOS 6.3.x.x , I doubt any solution would apply to your case anyway. AOS 8.x.x was a complete greenfield rewrite. 



    ------------------------------
    Bruce Osborne ACCP ACMP
    Liberty University

    The views expressed here are my personal views and not those of my employer
    ------------------------------



  • 7.  RE: Datapath CPU 21 - Utilization above threshold

    Posted Mar 11, 2024 09:06 AM

    My AC 7005 at home sometimes has similar alarms, but it does not affect the network. Your version is too low. I think I should contact tac again after the upgrade. I look forward to your follow-up.



    ------------------------------
    If my post was useful, please Accept Solution and Give Kudos.

    leo ma

    ACMX
    ------------------------------



  • 8.  RE: Datapath CPU 21 - Utilization above threshold

    Posted Mar 13, 2024 10:19 AM

    Thanks for the replies.
    MM is 8.10.0.10 LSR.
    Clusters are now 8.10.0.8 LSR as we rolled back because this CPU problem first occurred on the .10 version. Still happens on the .8 version though.
    I provided TAC witih show tech-support output while the problem was occurring. They had a look at that and requested the following output...when the problem is happening, so that's not much use until we hit the wall again!

    "I have checked the logs, and I have noticed that we could see policed frames hitting the Datapath bwm table. If possible, when the issue happened at that time, please collect the below outputs four times in 30 seconds and send them to me.
    Show cpuload current
    show datapath utilization
    show datapath bwm table
    show datapath cp-bwm table
    show firewall"

    What we see on our RADIUS servers when the problem kicks off is that ordered, and pretty well balanced incoming RADIUS traffic from the controllers, goes all scruffy, until the CPU problem sorts itself out and then the orderliness returns. In the clip below, the RADIUS traffic goes wrong at the vertical line, which corresponds with the CPU problem and client auth difficulty. Normal service is not resumed in this image, it just got quieter over night.


     Carson mentioned mcast and bcast traffic. This is what mcast looks like during the time of the issues we had over the last weekend (it's a 7 day chart so you can see what normal is)
    Skewed to one interface in the aggregated link?

    bcmc-optimization is off, default.
    Upstream I've got to query other engineers for a lift there, but from I can pull together is no suppression. I'd like to see something in Huawei switch logs to indicate some kind of storm. 
    Keeping on looking. Things are OK presently. Will post-back if I hear more from TAC.




    ------------------------------
    Nathan
    ------------------------------



  • 9.  RE: Datapath CPU 21 - Utilization above threshold

    EMPLOYEE
    Posted Mar 13, 2024 11:07 AM

    If you don't need to support a multicast application on the wireless network then I'd highly recommend looking at the BC/MC controls.



    ------------------------------
    Carson Hulcher, ACEX#110
    ------------------------------



  • 10.  RE: Datapath CPU 21 - Utilization above threshold

    Posted Mar 15, 2024 08:42 AM

    Had another cpu spike across all cluster yesterday, starting at 14:10. Caught it in real time as i was watching the upstream port-channel of a controller.

    TAC also advised enabling bcmc on all VLANs except mgmt VLAN (as Carson). Started doing that whilst the problem was 'live' yesterday.

    Applied bcmc to all client VLANs on controller 4 via the GUI, in two bacthes of changes. Controller 4 CPU 17 dropped back to normal when the second batch was applied.

    Applied bcmc to client VLANs on controller 5 via the GUI, one VLAN at a time starting with the VLANs in the second batch as applied to controller 4. The controller 4 CPU 17 dropped back to normal when the change was applied to VLAN 2511. 

    Moved on to controller 6 (4,5,6 is one cluster) - applied bcmc via the GUI to VLAN 2511 first, expecting to see the CPU drop – there was no change. Applied bcmc to all client VLANs on controller 6. The CPU stayed at 90%. We restarted controller 6, and the CPU went straight back to 90% after booting, and remained at that level.

    Have now applied bcmc to all controllers (10) via the folder level hierarchical config. The CPU on the remaining controllers has remained at 90%. bcmc had no effect.

    Collected a port mirror pcap of all in/out traffic from one controller during the cpu spike- very few retransmits observed (no worse than when the cpu is normal).

    At 22:20 in the evening, CPU 17 dropped back to normal.

    Spoke with TAC again this morning (cpu currently fine)
    TAC want:

    show processes sort-by cpu
    show processes sort-by memory
    show cp-bwcontracts
    show datapath message-queue counters
    show datapath debug dma counters
    show datapath maintenance counters
    show firewall
    show cpuload current x 4
    show datapath utilization x 4
    show datapath bwm table x 4
    show datapath cp-bwm table x 4

    + pcap filtered for tcp retrans

    + uplink images from monitoring server.

    Struggling to find any cause on the network presently. The common link is the aggregation router to which all controllers connect. All VLANs, including mgmt VLAN are delivered to the controllers down the port channel. We pondered breaking that model up and separating the mgmt VLAN from the client VLANs to see if that would shed any light. But I'm not entirely sure how disruptive that is going to be.

    Waiting game again now. Will post back after the next event.



    ------------------------------
    Nathan
    ------------------------------



  • 11.  RE: Datapath CPU 21 - Utilization above threshold

    Posted 21 days ago

    I can bring this thread that I hi-jacked to a close now.

    As Carson posted earlier about it being a good idea to apply bcmc to all client VLANs, it would appear that his change has sorted it. 
    We did have another max cpu event across all controllers, and it looks like that was caused because I missed bcmc off a couple of VLANs. Once applied, and the cpu had calmed down (not immediate, it took an aggregation switch reboot to force a drop of all traffic early one morning) we've not seen the problem since.

    No idea how we managed to run the infrastructure without bcmc for 3 and half years without an incident, and then suddenly got repeatedly hammered by it from Feb 20th this year. I can't get my head round that. Can't understand why it's not on as default either, nor why the integrator who set us up in the very first instance didn't set it to on. This is certainly the kind of experience you don't forget in a hurry, and so would simply do as a matter of course in any new installations.
    Over and out.



    ------------------------------
    Nathan
    ------------------------------



  • 12.  RE: Datapath CPU 21 - Utilization above threshold

    EMPLOYEE
    Posted 21 days ago

    Glad to hear you resolved the issue.  I'd look at other changes in the network or services used, see if there was a change to enable multicast on the wired side or some multicast application was implemented.  If you've got any kind of application/traffic monitoring in place, that might be helpful.



    ------------------------------
    Carson Hulcher, ACEX#110
    ------------------------------



  • 13.  RE: Datapath CPU 21 - Utilization above threshold

    MVP
    Posted 21 days ago

    We were involved with the initial implementation of BCMC optimization due to our use of multicast IPTV at that time. Perhaps there have been some changes in Apple devices performing more bonjour related multicast traffic.

    I would suspect BCMC technically is not in the Wi-Fi standard so would not be the default, for certification purposes.



    ------------------------------
    Bruce Osborne ACCP ACMP
    Liberty University

    The views expressed here are my personal views and not those of my employer
    ------------------------------