Web Interface for CollectD

See http://www.collectd.org/

CollectD has lots going for it IMHO especially compared to Ganglia (see my appeal for help on Ganglia). It doesn't cost much to run. It's very easy to set up. On the other hand, you cannot mix statistics from 32 and 64 bit versions of Linux. The web interface is dismal, to say the least - it's a mess to set up; does almost zero error checking and leaves it for you to guess what needs to be installed and where things must live. I spent the better part of a day once to get the web interface to come up and then it was nothing like what I wanted.

That was a couple of years ago and I had hoped someone would create a web interface that came closer to what I wanted. No joy - so I've done enough for me. Maybe what I've done will be close to what you want - or at least a decent start. Here's the story.

I manage a set of clusters for the CSG. I don't really want collectd to do much more than give me trends of what is happening. I want to look at a week or month's data and see "how full" the cluster is. I want to see if the network is too busy or if the nodes are paging too much. I don't want paper reports to pour over - just the big picture. Collectd data is just what I want. Viewing the data is another issue.

CollectD Setup

In our MOSIX cluster we pretend there are two types of nodes - a client and a gateway. Gateways are the nodes users login to and initiate tasks to run on the cluster. The gateways send the processes to the clients where they do most of their execution. For our very CPU intensive world, this model works remarkably well.

We've recently upgraded to Release 4 of Collectd. This simplified and complicated things. The config files changed, many types changed names and the paths to the data moved. Most of the details of the change are in the files, so this web page did not change much, except for the Release 4 config files that we use:

#   Almost all nodes use this - just collect local stats and pass them elsewhere
#   Gotta have these
LoadPlugin network
LoadPlugin syslog
#   Collect stuff on these
LoadPlugin df
LoadPlugin cpu
LoadPlugin memory
LoadPlugin processes
LoadPlugin swap
LoadPlugin interface

#   Use syslog or logfile so you know what the heck is going on
<Plugin syslog>
	LogLevel info
</Plugin>

<Plugin network>
	Server "192.168.1.7" "12345"
	TimeToLive "128"
	Forward false
	CacheFlush 1800
</Plugin>

<Plugin df>
	MountPoint "/"
	MountPoint "/home"
	MountPoint "/home0"
	MountPoint "/home1"
	MountPoint "/home2"
</Plugin>

#   Do all interfaces except local
<Plugin "interface">
  Interface "lo"
  IgnoreSelected true
</Plugin>

#   Only one node collects data on itself AND collects data from alll others
BaseDir     "/data/collectd"
#   Gotta have these
LoadPlugin network
LoadPlugin syslog
LoadPlugin rrdtool
#   Collect stuff on these
LoadPlugin df
LoadPlugin cpu
LoadPlugin memory
LoadPlugin processes
LoadPlugin swap
LoadPlugin interface

#   Use syslog or logfile so you know what the heck is going on
<Plugin syslog>
	LogLevel info
</Plugin>

<Plugin network>
	Server "192.168.1.7" "12345"
	Listen "192.168.1.7" "12345"
	TimeToLive "128"
	Forward false
	CacheFlush 1800
</Plugin>

<Plugin df>
	MountPoint "/"
	MountPoint "/home"
	MountPoint "/home0"
	MountPoint "/home1"
	MountPoint "/home2"
</Plugin>

#   Do all interfaces except local
<Plugin "interface">
  Interface "lo"
  IgnoreSelected true
</Plugin>

<Plugin rrdtool>
	DataDir "/data/collectd"
</Plugin>

The collectd listen could be a machine that is not in the cluster and whose sole task is to collect data from the other machines (and itself). Alternatively, collectd listen could be a machine that is in. If you have multiple architectures, you will need more than one listener (that was true with CollectD Release 3, perhaps it has changed). All nodes collect data on what's going in the machine and then forward it to the machine labeled Server in the network plugin. It's easy to get confused about this. Each machine send its collectd data to a listener (same architecture I think). You might even want more than one machine collecting data, but doing that is easy.

Making Images

The web site which actually displays my data is not part of the cluster, so one problem I have is getting my collectd data where it can be displayed. At one point I was regularly copying the data from the server to the web machine. This worked until I added my first 64 bit machine and then the web server could not generate graphs from a different machine type. I solved this by generating the graphs where the data is collected by collectd using a little crontab job like the following. Later another crontab script copies the PNG files to the web server. This means there's more copying than is strictly necessary, but at least it works.

#!/bin/bash
#
#   Create images for the collectd web site from the raw collectd data
#   This is necessary because the web site does not have the
#   software installed to create the images on the fly.
#
me=`basename $0`
h=`hostname`
time='day'
e=`date +%s`
s=`expr $e - 86400`
o=$time
tint=$time
datadir='/data/collectd'
pngdir='/data/phpimages'
clusterload='/data/tpg/etc/clusterload.pl'

#   Figure out groups based on host we run on
groups='MACC DATA'
if [ "$h" = "frodo" ]; then
  groups='DC'
fi
#   If these types are not the exact name used by collectd (e.g. traffic0),
#   you will want a symlink in the defs directory to the correct def file.
types='load memory swap traffic0 traffic1 df-home df-home0 df-home1 df-home2 df-tmp'

#-----------------------------------------------------------------
#   Sort out options and parameters
#-----------------------------------------------------------------
outfile=''
while [ -n "$(echo $1 | grep '-')" ]; do
  case $1 in
    -h )
      echo "Usage: $me [-options]"
      echo "Create images for the collectd web site."
      echo ""
      echo "Options:"
      echo "  -day   - create daily files"
      echo "  -week  - create weekly files"
      echo "  -month - create monthly files"
      echo "  -year  - create yearly files"
      echo "  -create name   - set name of output file (specify before -range)"
      echo "  -range yyyymmdd yyyymmdd  - create files from date to date"
      exit 1
      ;;
    -day )
      time='day'
      e=`date +%s`
      s=`expr $e - 86400`
      o=$time
      tint=$time
      ;;
    -week )
      time='week'
      e=`date +%s`
      s=`expr $e - 604800`
      o=$time
      tint=$time
      ;;
    -month )
      time='month'
      e=`date +%s`
      s=`expr $e - 2592000`
      o=$time
      tint=$time
      ;;
    -year )
      time='year'
      e=`date +%s`
      s=`expr $e - 31536000`
      o=$time
      tint=$time
      ;;
    -create )
      outfile=$2
      shift
      o=$outfile
      ;;
    -range )
      time="$2-$3$outfile"
      e=$3
      s=$2
      epre=`perl -e 'print substr($ARGV[0],0,6);' $e`
      spre=`perl -e 'print substr($ARGV[0],0,6);' $s`
      if [ "$epre" = "$spre" ]; then
        tint='month'
      else
        tint='year'
      fi
      shift
      shift
      ;;
    * )
      print "$me Invalid option '$1'"
      exit 1
  esac
  shift
done

for g in $groups; do
  for t in $types; do
    php makeimages.php quiet=1 htmlmsg=0 group=$g time=$time type=$t
  done
  tint="-interval $tint"
  #    Add -verbose below
  $clusterload -start $s -end $e -outfileave $pngdir/$g.ave.$o.png -outfiletotal $pngdir/$g.total.$o.png $tint $datadir/*/load/load.rrd
done
exit

The script shown above make-images.sh relies on having the PHP command line interface installed, so you can run PHP scripts. The makeimages.php script is a highly modified version of the script provided by the collectd folks. It does piles of checking for missing things and when things fail, it generates long, elaborate error messages so you have a chance to actually figure out what the heck is going on. You'll need it.

In the summer of 2008 I added another step invoking clusterload.pl, a Perl script I wrote, to come up with a pseudo-load for the entire cluster. Two numbers are plotted - one for the average load per node and one showing the total load (sum of all loads on all nodes). The script above generates two cluster load images (e.g. MACC.total.week.png) for each time period. These images are then referenced in the HTML and PHP scripts used to display information about the cluster.

I'm not going to explain all the details of the scripts - that's left as an exercise for the reader. These scripts should be modified by you, but at least you won't have to work nearly as hard at it as I did the first time.

Ask for release 4 (collectd-rel4.zip or release 3 (collectd-rel3.zip). All the files should be in one directory (see the shell script above) where they can be invoked. Run your version of this script to generate the PNG images. After it completes, you may want to run another script you have created to copy the newly created images somewhere where your web server can display them. You'll have to construct a few extra scripts to drive this as well as set up the proper crontab entries.

Sorry, I don't know the dependencies any more. If you try this, let me know the dependencies you find and I'll come back and document it here.

Showing the CollectD Data

Now you have a pile of images in your web space. Displaying these is straight forward. All the files to create the images are available, just ask. Feel free to swipe them and change them for your needs. Of course, feel free to crib the HTML files you see so you can build your own HTML files from collectd images. If anyone is interested, I'd be glad to share my PHP files to generate the pages you see above, just ask.

Good luck and if you come up with something you like better for your cluster or your server machines, I'd be pleased to hear about it.