Applications (Outlook, Calendar, Exchange, Certificate Services, Apache, Tomcat, etc)
HTTP Application Monitoring
Expected Content returned
Acceptable response time (10 seconds to load a web page is not okay)
Simulate a windows client application. e.g. click on an icon to launch Word. Enter some text. Save the document to a drive. Close word. Ensure the whole process worked.
Service level testing
e.g. a web application requires a web server, DNS, LDAP, etc. If the DNS server fails, then so will the web application.
Allow for cluster testing (e.g. 1 web server out of a cluster of 5 fails, notify about the web server outage, but not the web service outage)
Network File shares
SAN Monitoring
Citrix Servers and Services
Printers
Printer errors e.g. low toner
Print Queues
SNMP Devices
Hardware (i.e. Dell DRAC, Sun Solaris), both via hardware card and OS software.
UPS
Other environmental inputs (temperature, humidity, etc)
Must be able to assign multiple IP addresses to each device and test each IP address individually if needed.
Minimal impact on service being monitored
Minimal effort to monitor (and manage) clients (remote devices)
Do not require upgrades to existing infrastructure (e.g. must run latest version of software before it can be monitored)
Ability for remote monitoring servers to report to a cental server
Dependency aware (if a core router fails, do not send 100 alarms for devices behind it)
Allow for scheduled downtime (disable a test in the future)
Require authorisation
Require a reason to be displayed
Allow for regular maintenance windows (application is restarted every sun night - do not send out alarms)
Ability to delegate testing to other devices (eg. tier management structure)
Audit history in monitoring system ( server added date, when was monitoring disabled and why etc )
The system must be able to self-monitor
Be able to monitor 1000+ devices
Allow variable polling (some tests every 5 mins, some tests every 1 min)
Highly Reliable
Redundancy (if your main monitoring server fails, have a second server on standby)
Apply default thresholds to groups of devices. Allow "one off" exceptions to these thresholds. e.g. all file systems must be less than 90% full. For serverX /opt must be less than 94% full since it currently is at 93% and should not change.