System Monitoring with Xymon/Other Docs/FAQ/Generic Monitoring System Features

Requirements for a Monitoring System

edit

Alerts

edit
  • Send (Email/SMS/etc)
  • Acknowledge (display who is working on the issue)
  • Delay
  • Send to certain groups/individuals
  • Escalation Path
  • Ability to set severity of levels for each service test (eg, disk on a production server vs disk on a development server)
    • Different actions for different levels, i.e.
      • Level 1 (disk 95% full) alert Help Desk
      • Level 2 (disk 98% full) alert IT team

Display

edit
  • Include or integrate with a real-time display system (with colours: Red, Yellow, Green, Purple,White and Blue)
    • Red:
    • Yellow:
    • Green:
    • White:
    • Purple:
  • Display a time of last check
  • Show high level "summary" of status. eg. group Unix boxes together and show if any have issues
  • Ability to customise the display. e.g. summary page for IT helpdesk, Unix page for Unix admins, Network page for Networking Team.
  • Ability to restrict access to the monitoring system (we do not want the general community to see everything monitored)
  • Ability to search for a host

Monitor

edit
  • Microsoft Windows: Windows NT, Windows XP, Windows Vista.
    • Be able to process windows event logs and performance monitoring
  • UNIX: Solaris, AIX, HP-UX, IRIX, Linux, MacOS X, Tru64.
  • Services (DNS/FTP/SMTP/LDAP/etc)
  • Applications (Outlook, Calendar, Exchange, Certificate Services, Apache, Tomcat, etc)
    • HTTP Application Monitoring
      • Expected Content returned
      • Acceptable response time (10 seconds to load a web page is not okay)
    • Simulate a windows client application. e.g. click on an icon to launch Word. Enter some text. Save the document to a drive. Close word. Ensure the whole process worked.
  • Service level testing
    • e.g. a web application requires a web server, DNS, LDAP, etc. If the DNS server fails, then so will the web application.
  • Allow for cluster testing (e.g. 1 web server out of a cluster of 5 fails, notify about the web server outage, but not the web service outage)
  • Network File shares
  • SAN Monitoring
  • Citrix Servers and Services
  • Printers
    • Printer errors e.g. low toner
    • Print Queues
  • SNMP Devices
  • Hardware (i.e. Dell DRAC, Sun Solaris), both via hardware card and OS software.
  • UPS
  • Other environmental inputs (temperature, humidity, etc)
  • Nightly backup
    • Warn if backups take longer than expected
    • Alarm if some backups fail

Networking

edit
  • Provide integration with Cisco Works, or have similar functionality
  • WAN links, LAN links, VLANs, etc
    • Verify link is up
    • Verify Bandwidth is not saturated
  • Cisco/Networking hardware
    • CPU load
    • Environmental e.g. Power supplies, temperature alarms, etc
  • Ability to interact with probes (break down traffic to type and size)
  • Capture and track changes to hardware configurations

OS Monitoring

edit
  • Disk
  • Memory
  • Processes
  • Response time
  • CPU Load
  • Hardware failures
  • OS Alerts ( systems event logs and syslog )

Database monitoring

edit
  • Oracle
  • MySQL
  • MSSQL
  • Ingres

File Monitoring

edit
  • file growths, if exist etc

Customise

edit
  • Easy to extend/Customise your own tests (API to integrate with)
edit
  • Alert on trends, ie 10% growth over 1 month might be ok but over 2 hours isn't.
  • Provide trending for network bandwidth usage or any data collected

Integration

edit
  • Integrate with a helpdesk/Trouble Ticket system
    • Automatically Submit Tickets
    • Automatically Update existing Tickets
  • Integrate with (or include) an Asset management system
    • Display serial number, manufacturer, warranty periods, history of repairs/replacement, etc
  • Integrate with other monitoring systems e.g. Ciscoworks, Oracle Enterprise Manager, HP, Compaq Insight Manager, etc
  • Integrate with with Microsoft Operations Manager (MOM) or offer the similar functionality as available in MOM

Agents

edit
  • Locally installed agent to collect data (and temporarily store data locally)
  • Ability of central polling server to contact agent to get gathered data
  • Local agent has ability to send data to polling server
  • Ability to remotely update agents

Misc

edit
  • History retention
  • Provide reports
  • Must be able to assign multiple IP addresses to each device and test each IP address individually if needed.
  • Minimal impact on service being monitored
  • Minimal effort to monitor (and manage) clients (remote devices)
    • Do not require upgrades to existing infrastructure (e.g. must run latest version of software before it can be monitored)
  • Ability for remote monitoring servers to report to a cental server
  • Dependency aware (if a core router fails, do not send 100 alarms for devices behind it)
  • Allow for scheduled downtime (disable a test in the future)
    • Require authorisation
    • Require a reason to be displayed
  • Allow for regular maintenance windows (application is restarted every sun night - do not send out alarms)
  • Ability to delegate testing to other devices (eg. tier management structure)
  • Audit history in monitoring system ( server added date, when was monitoring disabled and why etc )
  • The system must be able to self-monitor
  • Be able to monitor 1000+ devices
  • Allow variable polling (some tests every 5 mins, some tests every 1 min)
  • Highly Reliable
  • Redundancy (if your main monitoring server fails, have a second server on standby)
  • Apply default thresholds to groups of devices. Allow "one off" exceptions to these thresholds. e.g. all file systems must be less than 90% full. For serverX /opt must be less than 94% full since it currently is at 93% and should not change.