Jun 242015
 

Intro

I’ve been searching long for a monitoring solution that will fit my needs. Tested the usual suspects like Nagios or Zabbix and I wasn’t impressed. At some point friend of mine recommended OMD – The Open Monitoring Distribution, initial tests were promising so I went with full scale deployment. Thanks Lysy!

insta-04

After more than a year spent with it I can certainly say I loved it – elegant, flexible and simple solution. Currently monitoring around 100 machines and the whole thingy happily runs on relatively lightweight (2 vCores, 2GB RAM) Debian based VM.

Another useful feature comes with bundled Nagvis – it allows for pictures of the racks to be used as backgrounds of “maps”, to visualise and quickly identify problematic server:

nagvis_datacentre

 

Adding new host/checks is as simple as:

  • edit a single file ~/etc/check_mk/main.mk
  • re-generate configuration and restart services with cmk -O
  • that’s it

At the core of OMD lies Check_MK combined with Icinga – what I especially liked about it is how easy it is to define and run local checks. Think about it as custom scripts that do some stuff and return output to monitoring server. Only your imagination is a limit here, among the things I monitor with local checks are:

  • available tapes in tape library (warning when 2 tapes and critical when only 0 free tapes left)
  • detailed status of RAID controller (pulling details with omreport, hpacucli or MegaCli64)
  • status of license servers
  • Grid Engine queue status
  • ThinLinc server status
  • and more…

 

Adding local check is easy, assuming check_mk agent is installed simply drop your script inside /usr/lib/check_mk_agent/local/ – during the next run check_mk will execute it and pass the result to the server. Doesn’t matter what scripting language you use, as long as output is a text string matching check_mk expectations.

Testing is also easy, simply telnet from server to port 6556 of your client and you’ll find local check results at the bottom of the status returned.

Anyway, this is my online notepad so there we go:

 


 

Adding client:

Preferably automated with Puppet. But for standalone hosts, manual installation, depending on OS can be triggered

wget http://monitoring.server/service/debian.sh && bash debian.sh
or
wget http://monitoring.server/service/rh.sh && bash rh.sh

 

I know, blindly executing bash scripts as root, yay! Here goes debian.sh:

wget http://monitoring.server/service/check-mk-agent_1.2.4p5-2_all.deb
wget http://monitoring.server/service/check-mk-agent-logwatch_1.2.4p5-2_all.deb
apt-get install xinetd snmpd -y
dpkg -i check-mk-agent_1.2.4p5-2_all.deb check-mk-agent-logwatch_1.2.4p5-2_all.deb
wget http://monitoring.server/service/snmpd.conf -O /etc/snmp/snmpd.conf
wget http://monitoring.server/service/snmpdDebian -O /etc/default/snmpd
wget http://monitoring.server/service/check_mk -O /etc/xinetd.d/check_mk
service snmpd restart

 

and rh.sh has

wget http://monitoring.server/service/check_mk-agent-1.2.4p5-1.noarch.rpm
wget http://monitoring.server/service/check_mk-agent-logwatch-1.2.4p5-1.noarch.rpm
yum install xinetd net-snmp -y
rpm -Uvh check_mk-agent-1.2.4p5-1.noarch.rpm check_mk-agent-logwatch-1.2.4p5-1.noarch.rpm
wget http://monitoring.server/service/snmpd.conf -O /etc/snmp/snmpd.conf
wget http://monitoring.server/service/snmpd -O /etc/sysconfig/snmpd
wget http://monitoring.server/service/check_mk -O /etc/xinetd.d/check_mk
service snmpd restart
chkconfig snmpd on
echo "Add this line to /etc/sysconfig/iptables before REJECT"
echo "-A INPUT -s 10.100.10.5 -j ACCEPT"
echo "and then"
echo "service iptables restart"

 

My copy of /etc/xinetd.d/check_mk has “only_from = monitoring.server” line in it so we do have some sort of protection.

Modify local firewall ingress rules too, in order to accept check_mk 6556/udp traffic from management server only.


 

Defining client on server:

ssh monitoring.server
sudo su - mysite
vim ~/etc/check_mk/main.mk

See documentation for more details on how to fill this file.

 


 

Local checks, starting with Monitoring Available Bacula Tapes

Cronjob gathers information about tape library status twice a day:

crontab -l
0 8,16 * * * /usr/local/bin/baculaTapes.sh

 

Actual cron script:

#!/bin/bash
echo "status slots Storage=Autochanger"| bconsole > /var/tmp/bacula.out

 

Now, we use check_mk agent capability to run local checks with the following script, vim /usr/lib/check_mk_agent/local/bacula_tapes.sh

 

#!/bin/bash
ERRORTAPES=`grep Error /var/tmp/bacula.out|wc -l`
APPENDTAPES=`egrep 'Append|Purged' /var/tmp/bacula.out|wc -l`
FULLTAPES=`grep Full /var/tmp/bacula.out|wc -l`

if [ $APPENDTAPES -lt 2 ] ; then
status=2
statustxt=CRITICAL
elif [ $APPENDTAPES -lt 3 ] || [ $ERRORTAPES -gt 0 ] ; then
status=1
statustxt=WARNING
else
status=0
statustxt=OK
fi

echo "$status Backup_AvailableTapes APPENDTAPES=$APPENDTAPES;3;1;25; $statustxt - $APPENDTAPES Usable Tapes, $FULLTAPES Full Tapes, $ERRORTAPES Error Tapes"

Ubuntu security patches

On some server we don’t enable automated patching (especially Ubuntu!) but we want to know if there are security patches available in order to apply them in controlled manner. We rely on update-notifier-common so make sure it’s installed.

vim /usr/lib/check_mk_agent/local/ububtu-security-updates.sh

#!/bin/bash
SECUPD=`/usr/lib/update-notifier/update-motd-updates-available|tail -n2|head -n1|cut -d" " -f1`
SECTEXT=`/usr/lib/update-notifier/update-motd-updates-available |xargs`

if [ $SECUPD -gt 2 ] ; then
        status=2
        statustxt=CRITICAL
elif [ $SECUPD -gt 1 ] ; then
        status=2
        statustxt=WARNING
else
        status=0
        statustxt=OK
fi


echo "$status UBUNTU_SECURITY PACKAGES=$SECUPD;1;5 $statustxt - $SECTEXT"

 


SAMBA status

/usr/lib/check_mk_agent/local/samba-test.sh

#!/bin/bash
SESSIONS=`smbstatus -b -d 0|egrep '^[0-9]'|wc -l|tr -d ' '`
if [ $SESSIONS -gt 200 ] ; then
        status=2
        statustxt=CRITICAL
else
        status=0
        statustxt=OK
fi


echo "$status SAMBA USERS=$SESSIONS;200;300 $statustxt - $SESSIONS user sessions, `smbstatus -d0 |grep version`"

 


 

FreeNX

/usr/lib/check_mk_agent/local/FreeNXusers.sh

#!/bin/bash
SESSIONS=`nxserver --list|grep abc|wc -l`
STRING=`nxserver --status|tail -n2|head -n1`
STRINGOK="NX> 110 NX Server is running"
# STRINGOK="110 NX Server is running"
if [ "$STRING" == "$STRINGOK" ];
then
status=0
statustxt=OK
else
status=2
statustxt=CRITICAL
fi
echo "$status FreeNX SESSIONS=$SESSIONS;25;30 $statustxt - $STRING, $SESSIONS users currently logged in"

 

SunRAY

 
/usr/lib/check_mk_agent/local/SunRayUsers.sh

#!/bin/bash
CARDS=`/opt/SUNWut/sbin/utsession -p|egrep 'Payflex'|wc -l`
SESSIONS=`/opt/SUNWut/sbin/utsession -p|egrep 'Payflex|pseudo'|wc -l`

if [ $SESSIONS -lt 1 ] ; then
status=2
statustxt=CRITICAL
elif [ $SESSIONS -lt 3 ] ; then
status=1
statustxt=WARNING
else
status=0
statustxt=OK
fi
echo "$status SunRays USERS=$CARDS;28;30 $statustxt - $SESSIONS Sunrays connected, $CARDS users currently logged in"

 


ThinLinc server

/usr/lib/check_mk_agent/local/ThinLinc.sh

#!/bin/bash
SESSIONS=`who|wc -l`
ThinLinc=`tail -n1 /var/log/thinlinc-user-licenses`
SERVICES=`for i in vsmserver vsmagent; do service $i status;done|grep running|wc -l`

if [ $SERVICES -lt 2 ] ; then
status=2
statustxt=CRITICAL
else
status=0
statustxt=OK
fi

echo "$status ThinLinc USERS=$SESSIONS;28;30 $statustxt - $SESSIONS sessions, $ThinLinc"

 


 

Grid Engine queue status

 

Dirty script to check number of online workers. First root outputs status of the queue (so check_mk doesn’t need access to SGE environment) to temp file then check_mk local check scans output temp file.

vim /usr/local/bin/sge.sh

#!/bin/bash
. /etc/profile.d/sge.sh
/apps/sge/bin/lx24-amd64/qstat -f -u "*" > /tmp/sge.status

Trigger every 5 minutes

crontab -l
*/5 * * * * /usr/local/bin/sge.sh

Actual local check /usr/lib/check_mk_agent/local/sge.sh

#!/bin/bash
OFFLINE=`cat /tmp/sge.status|grep au|wc -l`
ONLINE=`cat /tmp/sge.status|grep BIP|wc -l`

if [ "$OFFLINE" == 0 ];
then
status=0
statustxt=OK
else
status=2
statustxt=CRITICAL
fi

echo "$status GridEngine OFFLINE=$OFFLINE;1;2 $statustxt - $ONLINE out of 3 SGE workers online"

 


 

Dell PERC RAID status

cronjob

*/15 * * * * /usr/local/bin/raid.sh

where /usr/local/bin/raid.sh

 

#!/bin/bash
omreport storage vdisk controller=0  > /var/tmp/raid.txt
omreport storage pdisk controller=0  >> /var/tmp/raid.txt

and actual check_mk local check /usr/lib/check_mk_agent/local/raid.sh has:

#!/bin/bash
ONLINEDISKS=`grep Online /var/tmp/raid.txt|wc -l`
VDISKSTATUS=`head -n10 /var/tmp/raid.txt|grep Status|awk -F" " '{print $3}'`

if [ $ONLINEDISKS -lt 3 ] ; then
status=2
statustxt=CRITICAL
else
status=0
statustxt=OK
fi

echo "$status PERC_H310_Mini_Status ONLINEDISKS=$ONLINEDISKS;3;1;25; $statustxt – $ONLINEDISKS Online Disks in 3x RAID5. Vdisk status $VDISKSTATUS"

 

HP Smart Array P400

vim /usr/local/bin/raid.sh

#!/bin/bash
/usr/sbin/hpacucli ctrl slot=0 physicaldrive all show status > /var/tmp/raid-p.txt
/usr/sbin/hpacucli ctrl slot=0 logicaldrive all show status > /var/tmp/raid-v.txt

crontab -l

*/5 * * * * /usr/local/bin/p400.sh

vim /usr/lib/check_mk_agent/local/raid.sh

#!/bin/bash
ONLINEDISKS=`grep physicaldrive /var/tmp/raid-p.txt|wc -l`
HOTSPARE=`grep spare /var/tmp/raid-p.txt|wc -l`
VDISKSTATUS=`grep logicaldrive /var/tmp/raid-v.txt`

if [ $ONLINEDISKS -lt 12 ] ; then
status=2
statustxt=CRITICAL
else
status=0
statustxt=OK
fi

# varname=value;warn;crit;min;max, while the values ;warn;crit;min;max are optional values.
# varname=value;warn;crit

echo "$status SmartArrayP400 ONLINEDISKS=$ONLINEDISKS;11;10 $statustxt - $ONLINEDISKS Online disks: $HOTSPARE HotSpare, Vdisk status:$VDISKSTATUS"

 


 

Supermicro LSI Mega RAID SAS 9240-8i

cron

*/5 * * * * /usr/local/bin/raid.sh

vim /usr/local/bin/raid.sh

#!/bin/bash
/usr/local/bin/MegaCli64 -PDList -aALL|grep "Firmware state"|wc -l > /var/tmp/raid-p.txt
/usr/local/bin/MegaCli64 -PDList -aALL|grep "Firmware state"|grep Hotspare|wc -l > /var/tmp/raid-hs.txt
/usr/local/bin/MegaCli64 -LDInfo -Lall -aALL > /var/tmp/raid-v.txt

vim /usr/lib/check_mk_agent/local/raid.sh

#!/bin/bash
ONLINEDISKS=`cat /var/tmp/raid-p.txt`
HOTSPARE=`cat /var/tmp/raid-hs.txt`
VDISKSTATUS=`cat /var/tmp/raid-v.txt|grep State|xargs`

if [ $ONLINEDISKS -lt 12 ] ; then
status=2
statustxt=CRITICAL
else
status=0
statustxt=OK
fi

# varname=value;warn;crit;min;max, while the values ;warn;crit;min;max are optional values.
# varname=value;warn;crit

echo "$status LSIMegaRAID9240-8i ONLINEDISKS=$ONLINEDISKS;13;12 $statustxt - $ONLINEDISKS Online disks: $HOTSPARE HotSpare, Vdisks: $VDISKSTATUS"

 


 

Bonus, monitoring ESXi with OMD

Not really a local check but took me a while to figure out hot to pull status from standalone ESXi server so adding for reference. Add user on ESXi and grant R/O access. Test from CLI to make sure we’ve got it right

/omd/sites/mysite/share/check_mk/agents/special/agent_vsphere -D --debug -u 'xxxxxxxxxx' -s 'xxxxxxxxx' -i hostsystem,virtualmachine,datastore,counters --timeout 5 esxhost01.domain

Add to etc/check_mk/main.mk:

#-- Custom checks for ESXi servers ---
datasource_programs.append((
"/omd/sites/mysite/share/check_mk/agents/special/agent_vsphere -u 'xxxxxxxxx' -s 'xxxxxxxxxxxx' "
"-i hostsystem,virtualmachine,datastore,counters --direct "
"--hostname '<HOST>' --timeout 5 <IP>", [ "esxhost01.domain" ]
))

 

Bonus 2: Monitoring execution of Bacula backup jobs

That mechanism pulls job status from MySQL database and flags up job that hasn’t been successfully executed during the last say 25h (for daily jobs).

File /opt/omd/sites/prod/etc/nagios/conf.d/templates.cfg has the following generic service defined:

define service{
 name bacula-prod-backup-generic-service 
 host_name prod-backup
 service_description check-production-backup-jobs
 check_command check_bacula_at_prod-backup
 servicegroups backup-servers
 active_checks_enabled 1 ; Active service checks are enabled
 passive_checks_enabled 1 ; Passive service checks are enabled/accepted
 parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
 obsess_over_service 1 ; We should obsess over this service (if necessary)
 check_freshness 0 ; Default is to NOT check service 'freshness'
 notifications_enabled 1 ; Service notifications are enabled
 event_handler_enabled 1 ; Service event handler is enabled
 failure_prediction_enabled 1 ; Failure prediction is enabled
 process_perf_data 1 ; Process performance data
 retain_status_information 1 ; Retain status information across program restarts
 retain_nonstatus_information 1 ; Retain non-status information across program restarts
 notification_interval 0 ; Only send notifications on status change by default.
 is_volatile 0
 check_period 24x7
 normal_check_interval 5
 retry_check_interval 1
 max_check_attempts 4
 notification_period 24x7
 notification_options w,u,c,r
 # contact_groups admins
 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
 }

File /opt/omd/sites/prod/etc/nagios/conf.d/bacula_prod-backup.cfg has one service per job defined

define service {
use bacula-prod-backup-generic-service 
service_description Backup: Studies2010-1
check_command check_bacula_at_prod-backup!744!1!1!Studies2010-1
}
define service {
use bacula-prod-backup-generic-service 
service_description Backup: Studies2010-2
check_command check_bacula_at_prod-backup!744!1!1!Studies2010-2
}
define service {
use bacula-prod-backup-generic-service 
service_description Backup: Studies2010-3
check_command check_bacula_at_prod-backup!744!1!1!Studies2010-3
}
and so on.

and file /opt/omd/sites/prod/etc/nagios/conf.d/commands.cfg has the following command defined

define command {
 command_name check_bacula_at_prod-backup
 command_line $USER2$/check_bacula_prod-backup.pl -H '$ARG1$' -w '$ARG2$' -c '$ARG3$' -j '$ARG4$'
}

Finally, Perl script /omd/sites/prod/local/lib/nagios/plugins/check_bacula_prod-backup.pl has bacula database connection details embedded (sql user with R/O privileges is fine). I got a copy of check_bacula.pl version 0.0.3 from Bacula website, submitted by Julian Hein NETWAYS GmbH and modified by Silver Salonen – good job gentlemen.


Bonus 3, monitoring NAS4FREE with OMD

Click here.

 

Thanks

And the final note, many thanks to Mathias Kettner and his team for providing this fantastic tool under GPL license. Well done chaps.


Jan 052015
 

I’ve been experiencing problems with Dell running Centos 6 and Bacula 5.2, hooked up to Quantum Scalar i40 tape library with two LTO5 drives. Server has two HBAs, first used with server disks (PERC-310mini) and second LSI SAS2008 with external SAS port connected to tape library. More info about this setup in this post.

Problem: basically, after each server reboot autochanger device was missing.

After spending endless hours I end up with some workaround. Its not ideal, well to be honest its a dirty hack so if there is a better way of doing I would very much appreciate you dropping a quick comment!

So if you can’t find tape library changer under centos 6 with a quantum scalar i40 then read on…

But first, random picture from my library, it seems like she’s showing to her pal a funny cat picture on her phone.

insta-20

Background: Autochanger is being managed via one of the drives, this is called Control Path and being set once via autochanger web interface.

From time to time Quantum Scalar i40 autochanger is not getting detected after server reboot. In order to detect it we need to rescan SCSI bus.

Lets say tape drives are on controller 0, channel 0, with ID 0 LUN 0 and ID 1 LUN 0

[email protected]:~ # lsscsi
[0:0:0:0]    tape    HP       Ultrium 5-SCSI   Z64Z  /dev/st0 
[0:0:1:0]    tape    HP       Ultrium 5-SCSI   Z64Z  /dev/st1 
[1:0:32:0]   enclosu DP       BP12G+           1.00  -       
[1:2:0:0]    disk    DELL     PERC H310        2.12  /dev/sda

in which case we can find controller (aka Control Path) connected on LUN 1 of one of the drives – but it is not being detected by OS for some reason! This is a bit that puzzles me. I suspect that this is due to my Host Bus Adapters getting different IDs after reboot, i.e. sometimes PERC gets detected as 0 and sometimes SAS2008 gets it – quoted example shows the later case.

[email protected]:~ # echo 0 0 1 >  /sys/class/scsi_host/host0/scan
[email protected]:~ # echo 0 1 1 >  /sys/class/scsi_host/host0/scan

[email protected]:~ # lsscsi
[0:0:0:0]    tape    HP       Ultrium 5-SCSI   Z64Z  /dev/st0 
[0:0:1:0]    tape    HP       Ultrium 5-SCSI   Z64Z  /dev/st1 
[0:0:1:1]    mediumx QUANTUM  Scalar i40-i80   153G  /dev/sch0
[1:0:32:0]   enclosu DP       BP12G+           1.00  -       
[1:2:0:0]    disk    DELL     PERC H310        2.12  /dev/sda 

Solution: aka dirty hack, upon reboot we grep logs to check SCSI id of tapes and use that to rescan bus on tape drive SCSI id but changing LUN +1. Can be used to write init script that starts just before bacula-sd starts, I guess…

This one liner will generate commands we need:

grep tape /var/log/messages*|cut -d" " -f7|awk -F: '{print "echo "$2" "$3" "1 " > /sys/class/scsi_host/host"$1"/scan"}'

double check those lines and and run them.

Restart bacula storage daemon

service bacula-sd restart

Useful commands:

cat /proc/scsi/sg/device_hdr /proc/scsi/sg/devices
host	chan	id	lun	type	opens	qdepth	busy	online
0	0	0	0	1	1	254	0	1
0	0	1	0	1	1	254	0	1
1	0	32	0	13	1	256	0	1
1	2	0	0	0	1	256	0	1
0	0	1	1	8	1	254	1	1

[email protected]:~ # sg_scan
/dev/sg0: scsi0 channel=0 id=32 lun=0
/dev/sg1: scsi0 channel=2 id=0 lun=0
/dev/sg2: scsi1 channel=0 id=5 lun=0
/dev/sg3: scsi1 channel=0 id=7 lun=0
/dev/sg4: scsi1 channel=0 id=7 lun=1

tapeinfo -f /dev/sg2

Source:
How do I rescan the SCSI bus to add or remove a SCSI device without rebooting the computer

https://access.redhat.com/site/solutions/3941

I know more about SCSI now that I ever wished to know.