System Monitoring with Xymon/Other Docs/HOWTO/Custom Monitoring Scripts for Hobbit

Preamble edit

Disclaimer edit

I make no claims about the correctness or fitness for purpose of anything in this document, from the spelling to the technical knowledge. Anything in this document is to be used at your own risk, including the instruction to think. Any and all disclaimers apply. Any damages arising from use of this document or the instructions contained therein, in electronic, printed or any other form are your problem, not mine. Reading, downloading or touching any part of document denotes acceptance of this disclaimer. If you disagree with this, then stop reading this document at once. (Anyway, a lawsuit against me makes no sense. I have no money.)

Copyright edit

This document is copyrighted. I hereby grant any Hobbit user free use and distribution rights on this document or parts thereof on condition that:

  1. Hobbit remains an open source project
  2. You are not charging for this document

Scope edit

This document is intended for use by systems administrators looking to extend the scope of the Hobbit/BB monitoring tools with custom scripts.

Assumptions edit

  • Them being the mother of all stuff-ups, let me tell you right off the bat exactly what I assume about you.
  • You are a systems administrator, and probably want to make your life a little easier. This means you have permission to write any scripts and install them on the system, and that you have a basic clue about Unix, and the stuff you want to monitor.
  • You understand a programming or scripting language of some sort.
  • You have at very least sufficient understanding of Hobbit and the Big Brother or Hobbit client to get some basic results and a graph or two displayed, and have already done so. Put differently, I assume your Hobbit installation is already working!!!!!
  • You use Unix, or a variant thereof.

Although this document is written with the assumption that you know nothing, I am assuming you are not completely bloody clueless. I find that many manuals or how-to docs assume a little knowledge, but invariably the little they assume you know, is the very bit you don't know. I am trying to cover all the bases.

What this document does not do edit

I am not going to teach you how to write scripts, get data from your database, tell you why your script doesn't work or act as your helpdesk. I believe in the 80-20 rule. If you take up more than 20 seconds of my day, you better be prepared to pay me 80% of my salary. I am not going to tell you why your Hobbit installation doesn't work either. Subscribe to the Hobbit mailing list. That's why it's there.

Conventions edit

I use Unix (Solaris & Linux) and most the commands and reference will be based on Unix. I am also an Oracle DBA, so I will use database examples a lot. I use Korn shell for any programming examples. I am not going to get into a theological debate about why Korn and not Bash or Perl or whatever. I use Korn, because I know it far better than I know any other script or programming language. I use it because I know it well. End of debate! I suggest you do the same, whatever your language of choice may be. Unix commands will be in courier font, as will output. If you got this as a text document, and it's all one font, and you can't tell the difference between the main document and Unix commands, you really shouldn't be considering writing a custom script.

STOP! - THINK! edit

Before creating a custom script, sit down, and think carefully what you want to monitor, what you want to graph, and why. For example. Assume you were considering monitoring a log file for certain messages - but what would you graph? You could monitor CD-ROM use or capacity, but to what purpose? If there's a CD in the drive, it's there because you put it there. After all, you are the administrator, and your servers are in an access-controlled environment. If you want to monitor it, it needs to be of use to you. No point filling your screen with meaningless dots. If that's what you want, it can be easily fudged - see below.

OK, now you know what you want to monitor, why, and have decided what you will do with the data, I still do not suggest we jump in boots and all and start writing a custom script. Next stop is the Hobbit man pages. Check that you are not reinventing the wheel. Larry Wall once said "Historically speaking, the presence of wheels in Unix has never precluded their reinvention." He also said, "The three chief virtues of a programmer are: Laziness, Impatience and Hubris." I subscribe to the laziness virtue. Reinventing the wheel is bad, and quite often, the previous guy did a better job than you can, or will have time for.

Still can't find what you want? Next stop, http://www.deadcat.net These are all Big Brother add-ons, but most will integrate seamlessly into Hobbit, and the rest can be made to work with a shoehorn or a hammer. (Size of hammer undefined) I am still not prepared to reinvent the wheel.

At about this point, if you still haven't found what you want, you are probably going to have to invent your own designer wheel.

Before you start coding edit

Time to think again.

Color edit

Hobbit can only report a limited number of states, reflected by the status colours, and you only have control over 3 of them, viz. red, yellow and green.

Before you start, consider, and write down what colour any monitored conditions will return.

For instance, program core dumps generated in a user directory on the development system, does not really warrant a red status. Since it's something we can expect developers to do from time to time, why are we even monitoring it? If it happened on a production system in a user directory, it might warrant a yellow. If it happened in the working directory of your main financial system, that might warrant a red status.

Content edit

Now consider what you want your message to look like. Sending back the entire contents of /var/adm/messages is a waste of time and bandwidth. Sending back tail /var/adm/messages makes more sense. Isolating the offending message(s) is probably an even better idea. Your message that appears on the Hobbit screen (That's what you see when you click on the face or the blob) needs to be detailed enough to be of use, but also brief enough that you can see at a glance what's happening. Too much is just as bad as too little, especially if you are going to abdicate the task of monitoring to operations.

Graphing edit

Are you going to generate a graph? Does a graph make sense? We can only graph quantities over time. Graphing the results of the messages test makes no sense. Monitoring CPU temperature is good. Hard disk utilization, database table space, CPU utilization, these are all good things to graph. Messages are not good. Cores are not good. Error or core counts could be good, but if you are getting that many errors or cores per 5 minutes, I think you have issues more important than creating a new Hobbit script.

So you want a graph. That's fine, but what are you going to graph? If possible, it's probably better to reduce your figures to a percentage. It just makes life easier for the graphing tool, and makes it easy to put everything on the same graph. Let's come back to my database example. I have one table space that's over 50Gb in size, and using about 35Gb, and another that's only 700Mb in size, and using 300Mb. On the same graph, they would look idiotic, and even if I could see what their values were, I would still have to remember the total size of each table space to determine if space was becoming an issue or not. Viewed as a percentage, both can appear on the same graph, and I know that over 90% is a potential problem, no matter how big or small the table space is. (If you didn't understand the last paragraph because you don't know databases, read it again, but substitute "disk" for "table space". It's a similar concept.)

Data visibility edit

Are you going to make the information you use for the graph visible to the monitor? Big Brother and Hobbit allow you to send two type of information to the Hobbit client. They are, status info, which is displayed when you click on the face or blobby icon, and data, which is never displayed, but could be used to gather information. If the information is particularly complex, it's probably best not to display it.

Methods edit

Now that you know what you want, where you want it and how you want to show it, consider how you are going to get it. Monitoring disk space is easy. On most versions of Unix, df -k will tell you all about disk space, and on some versions even give you a percent figure. (Refer to the Unix Rosetta Stone for details about your Unix version.)

Getting the same information from a database is not always quite so simple, and often requires some fairly nifty SQL. Now we come to the choice of development language. As mentioned above, I use ksh (AKA Korn Shell), but there's nothing stopping me from writing it all in SQL, or Perl, or even in compiled C, except for the fact that I don't speak C, and I know just enough Perl to be dangerous.

Your choice of language has to take into account a few things.

  1. You need to be able to call external binary files.
  2. You need to be able to get the information you need
  3. It needs to be executable so BB (or Hobbit client) can call it.
  4. It would need to interpret environment variables.

These points explained edit

  1. Unless we can call the bb program, we are not going to get very far. In most shell scripts, it's trivial. In SQL and Perl, it's slightly less trivial, but still very simple, but in other languages, it could be a little more tricky.
  2. If from within my program written in <Insert language here>, I have no method of interrogating the database to determine table space utilization, there's not much point in using that language. From within ksh, I can call an SQL interpreter, and pass it an SQL script. Perl has libraries to talk directly to the database. As long as you can get the information you need, it's a good choice.
  3. If I need to explain this, step away from the computer.
  4. Most of the variables you will need are defined in $BBHOME/etc/bbdef.sh If your script can't read this, and interpret the values in it, you may have a problem.


Let's start coding edit

No matter what system or language you use, your script will probably have 4 sections.

  1. Define everything
  2. Collect data
  3. Massage/manipulate data
  4. Define colour and send

Defining everything could be as simple as sourcing the $BBHOME/etc/bbdef.sh file, but it may be a little more complicated. Those of you familiar with Oracle know there are a few further environment variables that need to be defined before I can run an SQL script. However you define everything, remember to document or comment everything. You will probably be the sucker having to maintain this.

Once you have all your variables and environment defined, do what you need to do to collect the data. It will probably be wise to send the data to a temp file. Collecting the data could be a single command, it could be a complex program that runs and interrogates multiple systems and databases. How you collect what you need is beyond the scope of this document but if, like me, you are lazy and looking for a shortcut, use Google. You will be amazed what a Google for "Oracle Tablespace script" will give you.

Now we have the data, we need to massage it into something usable. If it's going to appear on the screen, make it look good. Some of it might be data only, and will not appear on screen. Dump that to a separate file. It might also be easier to add the status colour decision logic at this point. It's your script, it's your call.


In the end, we have one, or maybe 2 files. $BBTMP/data.out.$$ and $BBTMP/status.out.$$ (If this looks too cryptic, I don't suggest you use ksh to write your script)

Now we send the output to Hobbit. The command is simple. Check your bb man pages. You can hard code it if you want, or use the defined variables, it really makes no difference, unless you want to make life easier in the future. (Hard coding normally means headaches later.)

Let's dissect a basic script. This will report my table space levels.

#!/usr/bin/ksh
# Define this to be a Korn shell script.

. ../etc/bbdef.sh  		   # Set up the BB environment
. /export/home/oracle/.profile   # Set up the Oracle environment
export TEST=ora                  # The test name
export COLOUR=green              # The default colour
# Note to all Americans. This is how I spell colour - no arguments.
export OUTFILE=$BBTMP/outfile.$$ # My temp file
# We are not using data, only status for this one, so we only have
# one temp file.
# Up to here we have defined the environment
date > $OUTFILE                  # Create my temp file

# Now we collect the data
sqlplus -s orauser/passwd @/export/home/oracle/tables >> $OUTFILE
# Do not worry about the above syntax. If you understand Oracle, it
# will make sense. If not, assume it is a df -k on table spaces.
# Replace it with df -k >> $OUTFILE 
# It is doing the equivalent on the database

# The sql file we just used formats the data for us, so all we 
# need to do is check for the status. The values that interest me 
# are in the sixth column, as percentage.
awk '{ print $6 }' $OUTFILE | while read VAL
do
   [ $VAL -gt 85 ] && COLOUR=yellow
   [ $VAL -gt 95 ] && COLOUR=red; break
done
# The logic here is very simple. We start with green, and get 
# worse. Once we hit red, it is not getting any worse, so we exit
# the loop.
# A more sophisticated script could do a lot more, but we are
# doing a basic script.
# Now we send the data

$BB $BBDISP "status $MACHINE.$TEST $COLOUR `< $OUTFILE`"

# BB, BBDISP and MACHINE should be defined in bbdef.sh
# This will send the contents of $OUTFILE to the Hobbit client
# It will also ensure that the colour is set correctly
# to whatever appears in $COLOUR
# $BB = the Big Brother (or Hobbit) client program
# $BBDISP = the hostname or IP address of your Hobbit server
# $MACHINE = the name of the machine being monitored
# $TEST = Defined above
# $COLOUR = Colour of the status on the Hobbit server

rm $OUTFILE
# Clean up our temp files

Add your script to Hobbit edit

First off, copy your script to $BBHOME/ext/ directory. We called this one bb-ora_ts.sh It's a script. Make it executable. chmod 755 will do it, but that depends on your security levels.

Now edit $BBHOME/etc/bb-bbexttab

This file will be of the form

servername:   : testname1 testname2 testname3

If you are on a server called wallaby, and you wanted to add the test we showed above, add or edit the line

wallaby:  : testname1 testname2 testname3 bb-ora_ts.sh

Save and exit, then restart the bb client. You can restart with any of these commands.

/etc/rc3.d/S99bb restart

or

su - bb
./runbb.sh restart 

or depending on your setup

su - bb
./bb/runbb.sh restart

Results edit

Give it a few minutes. You might want to go get a cup of coffee.

Eventually, this should cause a new column to appear on you Hobbit main page. Column title should be ora (because that was the value in the script) and if we click on the status icon, we should see something like this.

 Wed Jun  8 15:35:46 WST 2005
 Tablespace siebprod:XDB totals 20.0Mb and is 0% used.
 Tablespace siebprod:WHS_INDEX totals 1440.0Mb and is 33% used.
 Tablespace siebprod:WHS_DATA totals 1440.0Mb and is 37% used.
 Tablespace siebprod:USERS totals 100.0Mb and is 1% used.
 Tablespace siebprod:UNDOTBS1 totals 5000.0Mb and is 30% used.
 Tablespace siebprod:TOOLS totals 150.0Mb and is 66% used.
 Tablespace siebprod:SYSTEM totals 500.0Mb and is 69% used.
 Tablespace siebprod:SIEB_INDEX totals 52240.0Mb and is 68% used.
 Tablespace siebprod:SIEB_DATA totals 45120.0Mb and is 56% used.
 Tablespace siebprod:LOADER_DATA totals 500.0Mb and is 0% used.

A bit bland, but a good start. Because whatever is in $OUTFILE is going to appear on a web page, it is possible to add HTML tags to $OUTFILE. Download the bb-iostat from deadcat. The test identifies itself as vmio on your Hobbit screen, but the script is pretty funky, and does some fun things to the display, including inserting status icons, underlines and a few font manipulations.

One of the best ways to improve your programming, is to see what others have done. Download a few scripts from deadcat, and look at how they were written.

I use a more slightly complex version of the above example script for my Hobbit, and I end up with this.

Wed Jun 8 16:00:07 WST 2005 Oracle test on "siebprod": OK

   =============== Oracle Instance Check ================
     Instances specified in ORACLE_SIDS (siebprod) match those found in /export/home/bb/bb/ext/ora9tab

   =================== siebprod Check  ===================
     Database siebprod UP processes:  pmon smon lgwr dbw0 ckpt reco
     Paranoid test: Database siebprod is up. 

   ============= siebprod Tablespace Check ================
     Tablespace siebprod:XDB totals 20.0Mb and is 0% used.
     Tablespace siebprod:WHS_INDEX totals 1440.0Mb and is 33% used.
     Tablespace siebprod:WHS_DATA totals 1440.0Mb and is 37% used.
     Tablespace siebprod:USERS totals 100.0Mb and is 1% used.
     Tablespace siebprod:UNDOTBS1 totals 5000.0Mb and is 31% used.
     Tablespace siebprod:TOOLS totals 150.0Mb and is 66% used.
     Tablespace siebprod:SYSTEM totals 500.0Mb and is 69% used.
     Tablespace siebprod:SIEB_INDEX totals 52240.0Mb and is 68% used.
     Tablespace siebprod:SIEB_DATA totals 45120.0Mb and is 56% used.
     Tablespace siebprod:LOADER_DATA totals 500.0Mb and is 0% used.

A few icons to draw attention, a few underlines, some bold text. Anything to make the cause of the issue obvious.

What happened? edit

Every 5 minutes, or however often your Big Brother client is set to run, it will run every script listed in bb-bbexttab file. The script runs, and then calls the Big Brother message agent, which sends the data to the Hobbit server to be displayed. (A bit of an over-simplification, but it's good enough) If you used data instead of status when you called $BB

 $BB $BBDISP "data $MACHINE.$TEST $COLOUR `cat $OUTFILE`"

nothing will display, but we can still use the data, normally for graphing. To ensure your script works, change the data to status, and make sure Hobbit is actually receiving what you are supposed to be sending.

It's not working!! edit

So you went and made a cup of coffee, sat patiently drinking it, even burning your lip once, and still nothing appears on your Hobbit main screen. Here are a few things to check.

  1. Are any other tests coming through? It could be a network issue
  2. Are you sure Hobbit is set up and working correctly. Check the conn test.
  3. Did you use status when you called the BB client?
  4. Check the $BBHOME/BBOUT log file. Could be some interesting messages in there.
  5. Remember your script is running as user bb. Does bb have adequate permission to do what you want, like execute the test script?
  6. Make sure you are calling the Big Brother client correctly. (Easiest way to do that is to put an echo in front of the $BB..... line, and a redirect after, so the line changes from
$BB $BBDISP "status $MACHINE.$TEST $COLOUR `cat $OUTFILE`"

to

echo $BB $BBDISP "status $MACHINE.$TEST $COLOUR `cat $OUTFILE`" >/tmp/out

Wait a while (Or you can use this to test your script on command-line without having to wait or modify your script after success: [1]),

and then check /tmp/out; if it's empty, check your variable definitions.

  1. Check all paths are correct.
  2. Comment out the rm $OUTFILE line.
  3. Check that it's being written to.
  4. If you are still getting nowhere, consult some of the wise members on the mailing list.

That's it edit

If you get the expected display on your Hobbit monitor, then congratulations! You have successfully added your own custom script. I would recommend testing it very carefully. Make sure that colours change and that the display updates. Now that it works, you can start getting fancy and adding inline HTML and so forth, to make it look really cool.

What about the graph? edit

Graphing is a topic for another document, and maybe another author. (I need to see if I have time)

Tidying up edit

Your new test has a column name, but if you click on the column name, you get an error. If you click on "conn" on the main page, it tells you what the conn test does. To make your new test do the same, edit /usr/lib/hobbit/server/etc/columndoc.csv on your Hobbit server. Add the following.

Testname;Test description;

For my test, I would add

ora;The <b>ora</b> test gives more information about the table space in the database;

Exactly what you put in here is your business. It's a descriptive field only, so you don't even need to update it.


Written by Vernon Everett and sent to the Hobbitmon Mailing list