Network Management

last person joined: 21 hours ago 

Keep an informative eye on your network with HPE Aruba Networking network management solutions
Expand all | Collapse all

How are you monitoring Airwave?

This thread has been viewed 2 times
  • 1.  How are you monitoring Airwave?

    Posted Mar 26, 2015 09:44 AM

    I've got triggers set up for disk utilization on my primary and backup Airwave, and a loss of contact with the primary trigger set up on my backup Airwave.  I've also set up net-snmp to watch for load, httpd and postmaster processes, and some log watching (see below).  I've got Nagios checking the web UI and postgres.  It seems like it still isn't enough, though - I find out about most of my Airwave problems through user reports.

     

    What is everyone else doing to make sure that Airwave is working reliably?  If you've got snmpd running on your Airwave, what is in your snmpd.conf?

     

    Here's my snmpd.conf:

    # monitor max/min process counts
    proc httpd__________ 250 1
    proc postmaster 150 1
    
    # kb free threshold for /
    disk / 50000000
    # 20 percent free across the board
    includeAllDisks 20
    # load avg 1min 5min 15min
    load 30 30 30
    
    # watch for significant log messages
    logmatch pgsql-failure /var/log/pgsql 60 remaining connection slots are reserved for non-replication superuser connections
    logmatch rapids-failure /var/log/amp_events 60 No data has been processed
    logmatch segfaults /var/log/messages 60 segfault

     



  • 2.  RE: How are you monitoring Airwave?

    EMPLOYEE
    Posted Mar 26, 2015 05:56 PM

    What are the typical outtage reports that come in from your users?  Is the backup setup as a Failover or is it 2 active AMPs monitoring the same devices?

     

    If you can help us understand what the outtages are, we can help pinpoint ways to generate alerts based on activity.



  • 3.  RE: How are you monitoring Airwave?

    Posted Mar 27, 2015 09:14 AM

    That's the thing.  I can come up with some way of checking for an issue that I've already had easily enough.  That's how I got the set of log checks that I have right now.  I'm looking for a general approach to monitor an Airwave host and make sure that it is in good health, and hopefully find problems that I haven't encountered before.  I'm thinking of everything from performance issues causing reports to be delayed to AOS snmp bugs causing loss of polling data for a controller to running into the 32 bit size limit on some out of control cache files.  None of these triggered any sort of alarm or warning within Airwave, but they did cause symptoms that the users noticed (reports arriving a day late, or big gaps in data).

     

    I'm mostly thinking about things like monitoring Airwave system performance stats, which wont indicate what the problem is, just that there is something that needs to be investigated.  For instance, if any of v2_gets, v2_walks, or v2_oids is 0 for more than 2 poll cycles on my primary Airwave, something is wrong.  I'm just looking for suggestions of the best things to monitor to get a sense of the health of Airwave.



  • 4.  RE: How are you monitoring Airwave?

    EMPLOYEE
    Posted Mar 31, 2015 12:28 PM

    Good point.  I'll see if I can pass this along through our systems test group to see if they have any suggestions.



  • 5.  RE: How are you monitoring Airwave?

    EMPLOYEE
    Posted Mar 31, 2015 04:35 PM

    I brought this up with the Product Managers, most likely we're looking at a few different feature requests.  Asking for additional ways to monitor and alert is very broad, can you narrow down the view to say a list of top 5 items you'd like to see?



  • 6.  RE: How are you monitoring Airwave?

    Posted Mar 31, 2015 06:03 PM

    Rob,

     

    I'd like to turn it around: What should we be looking at? What things have filled up, run out, or gone haywire which we can monitor and alert on so as not to be surprised?



  • 7.  RE: How are you monitoring Airwave?

    EMPLOYEE
    Posted Apr 01, 2015 12:01 PM

    From my view point, I'm typically looking at the System -> Performance graphs.  I look at it both at the snapshot and over-time views.  If I'm seeing a steady increase in or an anomolous spike, then I investigate based on which graph the issue appears in, sometimes they align with events like upgrades or added numerous additional devices.  I also take a frequent look at System -> Status.  Anything red I investigate.

     

    I also frequently look at some of the logs periodically:

    low_level_service_watcher, service_watcher, httpd/error_log, async_logger_client, ap_list_cacher, pgsql are typical visits for me in the CLI.

     

    Extra things I look at:

    daemons

    top -c

    iotop (you may need to yum install this package to use)

    dstat (you may need to yum install this package to use)

     

    Of course, my usage is pretty different from typical usage since I'm usually looking for issues where I'm applying test load modules.  But these are the general things I'm looking at.



  • 8.  RE: How are you monitoring Airwave?

    Posted Apr 01, 2015 03:10 PM

    Your load testing setup is actually not far from what I'm looking for at the moment.  I've got Nagios watching that the web UI is up and running, and the backup AMP is watching the primary AMP, and our other NMS tools are watching to make sure that everything is responding to ICMP and SNMP.  I've got disk and memory and cpu monitoring through snmp, and the handful of log messsages that I've learned about the hard way.  I feel like the general up/down state of the host and basic host performance is covered.

     

    The big gap is log messages that I haven't learned about, soft service failures, and data loss that occurs without stopping any AMP processes.  For instance, we've seen an issue where an snmp bug in AOS gets snmpd on the controller stuck in a loop, leading to a loss of stats data from the controller that lasts between 3 and 24 hours.  The only way I found out about that was a user complaining that their report was coming up empty.  That seems like something that Airwave should be able to detect and alert on.  We've also seen an issue more recently where something happened on the Airwave side, and SnmpV2Fetcher was just not sending out any requests for several hours.  The system performance graphs showed zero gets, walks, or OIDs, and the database rates plummeted.  Again, that should be detectable, but there's nothing in place to catch it.

     

    For data loss, in the short term, I've put together a shell script that gets average data out of /var/airwave/rrd/system_snmp and /var/airwave/rrd/system_db_rates, and if the average is either unknown or less than some threshold, generates an alarm to me.  If I was writing a spec for a trigger, I'd basically want at least that, only a trigger and not a 5 minute shell script.  Even better would be something that watches for abnormal values (say one standard deviation from normal) rather than a fixed value. 

     

    For the fatal errors in logs and soft service failures, the big problem is that I'm only aware of the failure modes I personally have seen.  Even just a document that lists some critical errors that may show up and probably need attention would be a start.  Based on reading through /var/logs, there are some logs where seeing "Error" isn't that big a deal, and others where it could indicate a pretty significant problem.  I know that this is really a terrible ER, which is why I was hoping other AMP users might have suggestions based on issues they've had.



  • 9.  RE: How are you monitoring Airwave?

    Posted May 01, 2015 01:34 PM

    I've got another thing to add, based on the latest silent failure.  I've had occasional problems where polling is happening and postgres read/writes are happening, but the async logger client queue grows and no data is being recorded in rrds.  I'm still working on the root cause with support, but that's the sort of case that I would like Airwave to be able to detect and alert me to.  I just happened to log into Airwave to verify an AP's hardware type and noticed the data gap.  If I hadn't had to check on the AP, the problem might have continued into the weekend.