Your load testing setup is actually not far from what I'm looking for at the moment. I've got Nagios watching that the web UI is up and running, and the backup AMP is watching the primary AMP, and our other NMS tools are watching to make sure that everything is responding to ICMP and SNMP. I've got disk and memory and cpu monitoring through snmp, and the handful of log messsages that I've learned about the hard way. I feel like the general up/down state of the host and basic host performance is covered.
The big gap is log messages that I haven't learned about, soft service failures, and data loss that occurs without stopping any AMP processes. For instance, we've seen an issue where an snmp bug in AOS gets snmpd on the controller stuck in a loop, leading to a loss of stats data from the controller that lasts between 3 and 24 hours. The only way I found out about that was a user complaining that their report was coming up empty. That seems like something that Airwave should be able to detect and alert on. We've also seen an issue more recently where something happened on the Airwave side, and SnmpV2Fetcher was just not sending out any requests for several hours. The system performance graphs showed zero gets, walks, or OIDs, and the database rates plummeted. Again, that should be detectable, but there's nothing in place to catch it.
For data loss, in the short term, I've put together a shell script that gets average data out of /var/airwave/rrd/system_snmp and /var/airwave/rrd/system_db_rates, and if the average is either unknown or less than some threshold, generates an alarm to me. If I was writing a spec for a trigger, I'd basically want at least that, only a trigger and not a 5 minute shell script. Even better would be something that watches for abnormal values (say one standard deviation from normal) rather than a fixed value.
For the fatal errors in logs and soft service failures, the big problem is that I'm only aware of the failure modes I personally have seen. Even just a document that lists some critical errors that may show up and probably need attention would be a start. Based on reading through /var/logs, there are some logs where seeing "Error" isn't that big a deal, and others where it could indicate a pretty significant problem. I know that this is really a terrible ER, which is why I was hoping other AMP users might have suggestions based on issues they've had.