Intro
I’ve been searching long for a monitoring solution that will fit my needs. Tested the usual suspects like Nagios or Zabbix and I wasn’t impressed. At some point friend of mine recommended OMD – The Open Monitoring Distribution, initial tests were promising so I went with full scale deployment. Thanks Lysy!
After more than a year spent with it I can certainly say I loved it – elegant, flexible and simple solution. Currently monitoring around 100 machines and the whole thingy happily runs on relatively lightweight (2 vCores, 2GB RAM) Debian based VM.
Another useful feature comes with bundled Nagvis – it allows for pictures of the racks to be used as backgrounds of “maps”, to visualise and quickly identify problematic server:
Adding new host/checks is as simple as:
- edit a single file ~/etc/check_mk/main.mk
- re-generate configuration and restart services with cmk -O
- that’s it
At the core of OMD lies Check_MK combined with Icinga – what I especially liked about it is how easy it is to define and run local checks. Think about it as custom scripts that do some stuff and return output to monitoring server. Only your imagination is a limit here, among the things I monitor with local checks are:
- available tapes in tape library (warning when 2 tapes and critical when only 0 free tapes left)
- detailed status of RAID controller (pulling details with omreport, hpacucli or MegaCli64)
- status of license servers
- Grid Engine queue status
- ThinLinc server status
- and more…
Adding local check is easy, assuming check_mk agent is installed simply drop your script inside /usr/lib/check_mk_agent/local/ – during the next run check_mk will execute it and pass the result to the server. Doesn’t matter what scripting language you use, as long as output is a text string matching check_mk expectations.
Testing is also easy, simply telnet from server to port 6556 of your client and you’ll find local check results at the bottom of the status returned.
Anyway, this is my online notepad so there we go:
Adding client:
Preferably automated with Puppet. But for standalone hosts, manual installation, depending on OS can be triggered
wget http://monitoring.server/service/debian.sh && bash debian.sh or wget http://monitoring.server/service/rh.sh && bash rh.sh
I know, blindly executing bash scripts as root, yay! Here goes debian.sh:
wget http://monitoring.server/service/check-mk-agent_1.2.4p5-2_all.deb wget http://monitoring.server/service/check-mk-agent-logwatch_1.2.4p5-2_all.deb apt-get install xinetd snmpd -y dpkg -i check-mk-agent_1.2.4p5-2_all.deb check-mk-agent-logwatch_1.2.4p5-2_all.deb wget http://monitoring.server/service/snmpd.conf -O /etc/snmp/snmpd.conf wget http://monitoring.server/service/snmpdDebian -O /etc/default/snmpd wget http://monitoring.server/service/check_mk -O /etc/xinetd.d/check_mk service snmpd restart
and rh.sh has
wget http://monitoring.server/service/check_mk-agent-1.2.4p5-1.noarch.rpm wget http://monitoring.server/service/check_mk-agent-logwatch-1.2.4p5-1.noarch.rpm yum install xinetd net-snmp -y rpm -Uvh check_mk-agent-1.2.4p5-1.noarch.rpm check_mk-agent-logwatch-1.2.4p5-1.noarch.rpm wget http://monitoring.server/service/snmpd.conf -O /etc/snmp/snmpd.conf wget http://monitoring.server/service/snmpd -O /etc/sysconfig/snmpd wget http://monitoring.server/service/check_mk -O /etc/xinetd.d/check_mk service snmpd restart chkconfig snmpd on echo "Add this line to /etc/sysconfig/iptables before REJECT" echo "-A INPUT -s 10.100.10.5 -j ACCEPT" echo "and then" echo "service iptables restart"
My copy of /etc/xinetd.d/check_mk has “only_from = monitoring.server” line in it so we do have some sort of protection.
Modify local firewall ingress rules too, in order to accept check_mk 6556/udp traffic from management server only.
Defining client on server:
ssh monitoring.server sudo su - mysite vim ~/etc/check_mk/main.mk
See documentation for more details on how to fill this file.
Local checks, starting with Monitoring Available Bacula Tapes
Cronjob gathers information about tape library status twice a day:
crontab -l 0 8,16 * * * /usr/local/bin/baculaTapes.sh
Actual cron script:
#!/bin/bash echo "status slots Storage=Autochanger"| bconsole > /var/tmp/bacula.out
Now, we use check_mk agent capability to run local checks with the following script, vim /usr/lib/check_mk_agent/local/bacula_tapes.sh
#!/bin/bash ERRORTAPES=`grep Error /var/tmp/bacula.out|wc -l` APPENDTAPES=`egrep 'Append|Purged' /var/tmp/bacula.out|wc -l` FULLTAPES=`grep Full /var/tmp/bacula.out|wc -l` if [ $APPENDTAPES -lt 2 ] ; then status=2 statustxt=CRITICAL elif [ $APPENDTAPES -lt 3 ] || [ $ERRORTAPES -gt 0 ] ; then status=1 statustxt=WARNING else status=0 statustxt=OK fi echo "$status Backup_AvailableTapes APPENDTAPES=$APPENDTAPES;3;1;25; $statustxt - $APPENDTAPES Usable Tapes, $FULLTAPES Full Tapes, $ERRORTAPES Error Tapes"
Ubuntu security patches
On some server we don’t enable automated patching (especially Ubuntu!) but we want to know if there are security patches available in order to apply them in controlled manner. We rely on update-notifier-common so make sure it’s installed.
vim /usr/lib/check_mk_agent/local/ububtu-security-updates.sh
#!/bin/bash SECUPD=`/usr/lib/update-notifier/update-motd-updates-available|tail -n2|head -n1|cut -d" " -f1` SECTEXT=`/usr/lib/update-notifier/update-motd-updates-available |xargs` if [ $SECUPD -gt 2 ] ; then status=2 statustxt=CRITICAL elif [ $SECUPD -gt 1 ] ; then status=2 statustxt=WARNING else status=0 statustxt=OK fi echo "$status UBUNTU_SECURITY PACKAGES=$SECUPD;1;5 $statustxt - $SECTEXT"
SAMBA status
/usr/lib/check_mk_agent/local/samba-test.sh
#!/bin/bash SESSIONS=`smbstatus -b -d 0|egrep '^[0-9]'|wc -l|tr -d ' '` if [ $SESSIONS -gt 200 ] ; then status=2 statustxt=CRITICAL else status=0 statustxt=OK fi echo "$status SAMBA USERS=$SESSIONS;200;300 $statustxt - $SESSIONS user sessions, `smbstatus -d0 |grep version`"
FreeNX
/usr/lib/check_mk_agent/local/FreeNXusers.sh
#!/bin/bash SESSIONS=`nxserver --list|grep abc|wc -l` STRING=`nxserver --status|tail -n2|head -n1` STRINGOK="NX> 110 NX Server is running" # STRINGOK="110 NX Server is running" if [ "$STRING" == "$STRINGOK" ]; then status=0 statustxt=OK else status=2 statustxt=CRITICAL fi echo "$status FreeNX SESSIONS=$SESSIONS;25;30 $statustxt - $STRING, $SESSIONS users currently logged in"
SunRAY
/usr/lib/check_mk_agent/local/SunRayUsers.sh
#!/bin/bash CARDS=`/opt/SUNWut/sbin/utsession -p|egrep 'Payflex'|wc -l` SESSIONS=`/opt/SUNWut/sbin/utsession -p|egrep 'Payflex|pseudo'|wc -l` if [ $SESSIONS -lt 1 ] ; then status=2 statustxt=CRITICAL elif [ $SESSIONS -lt 3 ] ; then status=1 statustxt=WARNING else status=0 statustxt=OK fi echo "$status SunRays USERS=$CARDS;28;30 $statustxt - $SESSIONS Sunrays connected, $CARDS users currently logged in"
ThinLinc server
/usr/lib/check_mk_agent/local/ThinLinc.sh
#!/bin/bash SESSIONS=`who|wc -l` ThinLinc=`tail -n1 /var/log/thinlinc-user-licenses` SERVICES=`for i in vsmserver vsmagent; do service $i status;done|grep running|wc -l` if [ $SERVICES -lt 2 ] ; then status=2 statustxt=CRITICAL else status=0 statustxt=OK fi echo "$status ThinLinc USERS=$SESSIONS;28;30 $statustxt - $SESSIONS sessions, $ThinLinc"
Grid Engine queue status
Dirty script to check number of online workers. First root outputs status of the queue (so check_mk doesn’t need access to SGE environment) to temp file then check_mk local check scans output temp file.
vim /usr/local/bin/sge.sh
#!/bin/bash . /etc/profile.d/sge.sh /apps/sge/bin/lx24-amd64/qstat -f -u "*" > /tmp/sge.status
Trigger every 5 minutes
crontab -l */5 * * * * /usr/local/bin/sge.sh
Actual local check /usr/lib/check_mk_agent/local/sge.sh
#!/bin/bash OFFLINE=`cat /tmp/sge.status|grep au|wc -l` ONLINE=`cat /tmp/sge.status|grep BIP|wc -l` if [ "$OFFLINE" == 0 ]; then status=0 statustxt=OK else status=2 statustxt=CRITICAL fi echo "$status GridEngine OFFLINE=$OFFLINE;1;2 $statustxt - $ONLINE out of 3 SGE workers online"
Dell PERC RAID status
cronjob
*/15 * * * * /usr/local/bin/raid.sh
where /usr/local/bin/raid.sh
#!/bin/bash omreport storage vdisk controller=0 > /var/tmp/raid.txt omreport storage pdisk controller=0 >> /var/tmp/raid.txt
and actual check_mk local check /usr/lib/check_mk_agent/local/raid.sh has:
#!/bin/bash ONLINEDISKS=`grep Online /var/tmp/raid.txt|wc -l` VDISKSTATUS=`head -n10 /var/tmp/raid.txt|grep Status|awk -F" " '{print $3}'` if [ $ONLINEDISKS -lt 3 ] ; then status=2 statustxt=CRITICAL else status=0 statustxt=OK fi echo "$status PERC_H310_Mini_Status ONLINEDISKS=$ONLINEDISKS;3;1;25; $statustxt – $ONLINEDISKS Online Disks in 3x RAID5. Vdisk status $VDISKSTATUS"
HP Smart Array P400
vim /usr/local/bin/raid.sh
#!/bin/bash /usr/sbin/hpacucli ctrl slot=0 physicaldrive all show status > /var/tmp/raid-p.txt /usr/sbin/hpacucli ctrl slot=0 logicaldrive all show status > /var/tmp/raid-v.txt
crontab -l
*/5 * * * * /usr/local/bin/p400.sh
vim /usr/lib/check_mk_agent/local/raid.sh
#!/bin/bash ONLINEDISKS=`grep physicaldrive /var/tmp/raid-p.txt|wc -l` HOTSPARE=`grep spare /var/tmp/raid-p.txt|wc -l` VDISKSTATUS=`grep logicaldrive /var/tmp/raid-v.txt` if [ $ONLINEDISKS -lt 12 ] ; then status=2 statustxt=CRITICAL else status=0 statustxt=OK fi # varname=value;warn;crit;min;max, while the values ;warn;crit;min;max are optional values. # varname=value;warn;crit echo "$status SmartArrayP400 ONLINEDISKS=$ONLINEDISKS;11;10 $statustxt - $ONLINEDISKS Online disks: $HOTSPARE HotSpare, Vdisk status:$VDISKSTATUS"
Supermicro LSI Mega RAID SAS 9240-8i
cron
*/5 * * * * /usr/local/bin/raid.sh
vim /usr/local/bin/raid.sh
#!/bin/bash /usr/local/bin/MegaCli64 -PDList -aALL|grep "Firmware state"|wc -l > /var/tmp/raid-p.txt /usr/local/bin/MegaCli64 -PDList -aALL|grep "Firmware state"|grep Hotspare|wc -l > /var/tmp/raid-hs.txt /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL > /var/tmp/raid-v.txt
vim /usr/lib/check_mk_agent/local/raid.sh
#!/bin/bash ONLINEDISKS=`cat /var/tmp/raid-p.txt` HOTSPARE=`cat /var/tmp/raid-hs.txt` VDISKSTATUS=`cat /var/tmp/raid-v.txt|grep State|xargs` if [ $ONLINEDISKS -lt 12 ] ; then status=2 statustxt=CRITICAL else status=0 statustxt=OK fi # varname=value;warn;crit;min;max, while the values ;warn;crit;min;max are optional values. # varname=value;warn;crit echo "$status LSIMegaRAID9240-8i ONLINEDISKS=$ONLINEDISKS;13;12 $statustxt - $ONLINEDISKS Online disks: $HOTSPARE HotSpare, Vdisks: $VDISKSTATUS"
Bonus, monitoring ESXi with OMD
Not really a local check but took me a while to figure out hot to pull status from standalone ESXi server so adding for reference. Add user on ESXi and grant R/O access. Test from CLI to make sure we’ve got it right
/omd/sites/mysite/share/check_mk/agents/special/agent_vsphere -D --debug -u 'xxxxxxxxxx' -s 'xxxxxxxxx' -i hostsystem,virtualmachine,datastore,counters --timeout 5 esxhost01.domain
Add to etc/check_mk/main.mk:
#-- Custom checks for ESXi servers --- datasource_programs.append(( "/omd/sites/mysite/share/check_mk/agents/special/agent_vsphere -u 'xxxxxxxxx' -s 'xxxxxxxxxxxx' " "-i hostsystem,virtualmachine,datastore,counters --direct " "--hostname '<HOST>' --timeout 5 <IP>", [ "esxhost01.domain" ] ))
Bonus 2: Monitoring execution of Bacula backup jobs
That mechanism pulls job status from MySQL database and flags up job that hasn’t been successfully executed during the last say 25h (for daily jobs).
File /opt/omd/sites/prod/etc/nagios/conf.d/templates.cfg has the following generic service defined:
define service{ name bacula-prod-backup-generic-service host_name prod-backup service_description check-production-backup-jobs check_command check_bacula_at_prod-backup servicegroups backup-servers active_checks_enabled 1 ; Active service checks are enabled passive_checks_enabled 1 ; Passive service checks are enabled/accepted parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems) obsess_over_service 1 ; We should obsess over this service (if necessary) check_freshness 0 ; Default is to NOT check service 'freshness' notifications_enabled 1 ; Service notifications are enabled event_handler_enabled 1 ; Service event handler is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts notification_interval 0 ; Only send notifications on status change by default. is_volatile 0 check_period 24x7 normal_check_interval 5 retry_check_interval 1 max_check_attempts 4 notification_period 24x7 notification_options w,u,c,r # contact_groups admins register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! }
File /opt/omd/sites/prod/etc/nagios/conf.d/bacula_prod-backup.cfg has one service per job defined
define service { use bacula-prod-backup-generic-service service_description Backup: Studies2010-1 check_command check_bacula_at_prod-backup!744!1!1!Studies2010-1 } define service { use bacula-prod-backup-generic-service service_description Backup: Studies2010-2 check_command check_bacula_at_prod-backup!744!1!1!Studies2010-2 } define service { use bacula-prod-backup-generic-service service_description Backup: Studies2010-3 check_command check_bacula_at_prod-backup!744!1!1!Studies2010-3 } and so on.
and file /opt/omd/sites/prod/etc/nagios/conf.d/commands.cfg has the following command defined
define command { command_name check_bacula_at_prod-backup command_line $USER2$/check_bacula_prod-backup.pl -H '$ARG1$' -w '$ARG2$' -c '$ARG3$' -j '$ARG4$' }
Finally, Perl script /omd/sites/prod/local/lib/nagios/plugins/check_bacula_prod-backup.pl has bacula database connection details embedded (sql user with R/O privileges is fine). I got a copy of check_bacula.pl version 0.0.3 from Bacula website, submitted by Julian Hein NETWAYS GmbH and modified by Silver Salonen – good job gentlemen.
Bonus 3, monitoring NAS4FREE with OMD
Click here.
Thanks
And the final note, many thanks to Mathias Kettner and his team for providing this fantastic tool under GPL license. Well done chaps.