Wireless Access

Reply
Occasional Contributor I

MM becomes unresponsive - processes stuck in PROCESS_NOT_RESPONDING_CRITICAL - reproduces on 8.5/8.6

tl;dr - VMM becomes unresponsive after a short period, and must be periodically rebooted.

The MM is running on Proxmox Linux 6.1 (KVM). We've allocated 3 vCPU cores, 6 GB of RAM, and two virtual disks of 6 GB and 8GB.
We have around 3 controllers (MDs) which are Aruba 7030 hardware appliances.
The Wifi setup isn't complex - just some WLANs on WPA2-Personal, and a WLAN on WPA2-Enterprise using the internal Auth server on the MM. We have enabled DPI, as well as WebCC (although I believe the issue reproduced without WebCC as well).
The issue seems to reproduce on both AOS 8.6.0.0 and AOS 8.5.0.5.
After a while (few hours, to a day or so), the MM appears to become unresponsive - the Web UI simply hangs when you try to access it, or the CLI hangs on commands. Several processes seem to be in state PROCESS_NOT_RESPONDING or PROCESS_NOT_RESPONDING_CRITICAL:

(ArubaMM-VA_EA_F5_7F) [mynode] #show process monitor statistics
 
Process Monitoring Action:Log Message
 
 
Process Monitor Statistics
--------------------------
Name State Restarts Allowed Restarts Timeout Value Timeout Chances Time Started
---- ----- ---------------- -------- ------------- --------------- ------------
/mswitch/bin/cmdexec PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:12 2020
/mswitch/bin/dbstart_pgsql PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:12 2020
/mswitch/bin/dbstart_mongo PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:13 2020
/mswitch/bin/ctrlmgmt PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:13 2020
/mswitch/bin/packet_filter PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:14 2020
/mswitch/bin/cryptoPOST PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:14 2020
/mswitch/bin/gsmmgr PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:14 2020
/mswitch/bin/pubsub PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:14 2020
/mswitch/bin/cfgm PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:14 2020
/mswitch/bin/rng-mgr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:15 2020
/mswitch/bin/certmgr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:15 2020
/mswitch/bin/cfgdist PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:15 2020
/mswitch/bin/syslogdwrap PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:15 2020
/mswitch/bin/aaa PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:15 2020
/mswitch/bin/fpapps PROCESS_RUNNING 0 0 240 3 Sun Jan 12 23:09:16 2020
/mswitch/bin/lagm PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:16 2020
/mswitch/bin/pim PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:16 2020
/mswitch/bin/licensemgr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:16 2020
/mswitch/bin/isakmpd PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:16 2020
/mswitch/bin/profmgr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:17 2020
/mswitch/bin/msghh PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:17 2020
/mswitch/bin/auth PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:17 2020
/mswitch/bin/appRF PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:17 2020
/mswitch/bin/stm PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:18 2020
/mswitch/bin/amon_sender PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:18 2020
/mswitch/bin/amon_recvr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:18 2020
/mswitch/bin/wms PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:18 2020
/mswitch/bin/udbserver PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:18 2020
/mswitch/bin/dhcpdwrap PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:19 2020
/mswitch/bin/rsyncwrap PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:19 2020
/mswitch/bin/radvdwrap PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:19 2020
/mswitch/bin/mobileip PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:19 2020
/mswitch/bin/phonehome PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:19 2020
/mswitch/bin/hwMon PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:20 2020
/mswitch/bin/snmpd PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:20 2020
/mswitch/bin/trapd PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:20 2020
/mswitch/bin/ntpwrap PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:20 2020
/mswitch/bin/dbsync PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:21 2020
/mswitch/bin/slb PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:21 2020
/mswitch/bin/resolvwrap PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:21 2020
/mswitch/bin/resolv_hostname PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:21 2020
/mswitch/bin/cts PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:22 2020
/mswitch/bin/httpd_wrap PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:22 2020
/mswitch/bin/fw_visibility PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:23 2020
/mswitch/bin/ctamon PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:23 2020
/mswitch/bin/utild PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:23 2020
/mswitch/bin/ospf PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:24 2020
/mswitch/bin/lldpd PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:24 2020
/mswitch/bin/util_proc PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:24 2020
/mswitch/bin/cpsec PROCESS_NOT_RESPONDING 8 0 240 3 Sun Jan 12 23:09:24 2020
/mswitch/bin/spectrum PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:25 2020
/mswitch/bin/iapmgr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:25 2020
/mswitch/bin/mdns PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:25 2020
/mswitch/bin/arm PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:25 2020
/mswitch/bin/mcell PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:26 2020
/mswitch/bin/dds PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:26 2020
/mswitch/bin/ipstm PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:26 2020
/mswitch/bin/ha_mgr PROCESS_RUNNING 8 0 240 3 Sun Jan 12 23:09:26 2020
/mswitch/bin/ucm PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:26 2020
/mswitch/bin/web_cc PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:27 2020
/mswitch/bin/cert_dwnld PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:27 2020
/mswitch/bin/sc_rep_mgr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:27 2020
/mswitch/bin/redisdbstart PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:27 2020
/mswitch/bin/nbapistart PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:27 2020
/mswitch/bin/nbapi_helper PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:28 2020
/mswitch/bin/mon_serv PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:28 2020
/mswitch/bin/mon_serv_fwv PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:28 2020
/mswitch/bin/cluster_upg_mgr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:28 2020
/mswitch/bin/ofc_cli_agent PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:28 2020
/mswitch/bin/splunk_zmq PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:29 2020
/mswitch/bin/topology PROCESS_NOT_RESPONDING_CRITICAL - 0 240 3 Sun Jan 12 23:09:29 2020
/mswitch/bin/topology_discovery PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:29 2020
/mswitch/bin/routing_switch PROCESS_NOT_RESPONDING_CRITICAL - 0 240 3 Sun Jan 12 23:09:29 2020
/mswitch/bin/flow_manager PROCESS_NOT_RESPONDING_CRITICAL - 0 240 3 Sun Jan 12 23:09:29 2020
/mswitch/bin/packetin_dispatcher PROCESS_NOT_RESPONDING_CRITICAL - 0 240 3 Sun Jan 12 23:09:30 2020
/mswitch/bin/event_dispatcher PROCESS_NOT_RESPONDING_CRITICAL - 0 240 3 Sun Jan 12 23:09:30 2020
/mswitch/bin/switch_manager PROCESS_NOT_RESPONDING_CRITICAL - 0 240 3 Sun Jan 12 23:09:30 2020
/mswitch/bin/upgrademgr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:30 2020
/mswitch/bin/dpagent PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:30 2020
/mswitch/bin/mcellsolverstart PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:31 2020
/mswitch/bin/airmatch_recv PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:31 2020
/mswitch/bin/bocmgr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:31 2020
/mswitch/bin/hcm PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:31 2020
/mswitch/bin/upl_sync_mgr PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:31 2020
/mswitch/bin/vrrp PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:32 2020
/mswitch/bin/ble_relay PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:32 2020
/mswitch/bin/dot1x1 PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:32 2020
/mswitch/bin/im_helper PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:32 2020
/mswitch/bin/user_visibility PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:32 2020
/mswitch/bin/impystart PROCESS_RUNNING - 0 240 3 Sun Jan 12 23:09:33 2020

We have collected tech support logs before it gets to the hanging state - but we can't during, since the Web UI hangs, and also, from the CLI, even if you run "show techsupport", it hangs after a certain point:
Aruba TAC seems stumped, and it's been a couple weeks - it's driving us bonkers.

 

Highlighted
MVP Guru

Re: MM becomes unresponsive - processes stuck in PROCESS_NOT_RESPONDING_CRITICAL - reproduces on 8.5

Are you sure the underlying hardware is functioning correct? And are you sure you didn't oversubscribe the KVM host? You should have dedicated resources assigned for the virtual mobility master, which are not shared with other processes.

 

Messages would indicate to me that there is a resource shortage coming from the KVM host. In the case you see these not responding processes, what do you see in the monitoring of resources (cpu, memory, disk i/o, io-wait) and on the KVM host as well?

 

Did you follow the guidelines from the ArubaOS Virtual Appliance Installation Guide section on KVM, like network interfaces of virtio type?

 

Would you have capacity to assign more resources to the VMM to see if there is a resource shortage or the issue lies in something else?

--
If you have urgent issues, please contact your Aruba partner or Aruba TAC (click for contact details).
Search Airheads
cancel
Showing results for 
Search instead for 
Did you mean: