Jun 242015
 

Intro

I’ve been searching long for a monitoring solution that will fit my needs. Tested the usual suspects like Nagios or Zabbix and I wasn’t impressed. At some point friend of mine recommended OMD – The Open Monitoring Distribution, initial tests were promising so I went with full scale deployment. Thanks Lysy!

insta-04

After more than a year spent with it I can certainly say I loved it – elegant, flexible and simple solution. Currently monitoring around 100 machines and the whole thingy happily runs on relatively lightweight (2 vCores, 2GB RAM) Debian based VM.

Another useful feature comes with bundled Nagvis – it allows for pictures of the racks to be used as backgrounds of “maps”, to visualise and quickly identify problematic server:

nagvis_datacentre

 

Adding new host/checks is as simple as:

  • edit a single file ~/etc/check_mk/main.mk
  • re-generate configuration and restart services with cmk -O
  • that’s it

At the core of OMD lies Check_MK combined with Icinga – what I especially liked about it is how easy it is to define and run local checks. Think about it as custom scripts that do some stuff and return output to monitoring server. Only your imagination is a limit here, among the things I monitor with local checks are:

  • available tapes in tape library (warning when 2 tapes and critical when only 0 free tapes left)
  • detailed status of RAID controller (pulling details with omreport, hpacucli or MegaCli64)
  • status of license servers
  • Grid Engine queue status
  • ThinLinc server status
  • and more…

 

Adding local check is easy, assuming check_mk agent is installed simply drop your script inside /usr/lib/check_mk_agent/local/ – during the next run check_mk will execute it and pass the result to the server. Doesn’t matter what scripting language you use, as long as output is a text string matching check_mk expectations.

Testing is also easy, simply telnet from server to port 6556 of your client and you’ll find local check results at the bottom of the status returned.

Anyway, this is my online notepad so there we go:

 


 

Adding client:

Preferably automated with Puppet. But for standalone hosts, manual installation, depending on OS can be triggered

wget http://monitoring.server/service/debian.sh && bash debian.sh
or
wget http://monitoring.server/service/rh.sh && bash rh.sh

 

I know, blindly executing bash scripts as root, yay! Here goes debian.sh:

wget http://monitoring.server/service/check-mk-agent_1.2.4p5-2_all.deb
wget http://monitoring.server/service/check-mk-agent-logwatch_1.2.4p5-2_all.deb
apt-get install xinetd snmpd -y
dpkg -i check-mk-agent_1.2.4p5-2_all.deb check-mk-agent-logwatch_1.2.4p5-2_all.deb
wget http://monitoring.server/service/snmpd.conf -O /etc/snmp/snmpd.conf
wget http://monitoring.server/service/snmpdDebian -O /etc/default/snmpd
wget http://monitoring.server/service/check_mk -O /etc/xinetd.d/check_mk
service snmpd restart

 

and rh.sh has

wget http://monitoring.server/service/check_mk-agent-1.2.4p5-1.noarch.rpm
wget http://monitoring.server/service/check_mk-agent-logwatch-1.2.4p5-1.noarch.rpm
yum install xinetd net-snmp -y
rpm -Uvh check_mk-agent-1.2.4p5-1.noarch.rpm check_mk-agent-logwatch-1.2.4p5-1.noarch.rpm
wget http://monitoring.server/service/snmpd.conf -O /etc/snmp/snmpd.conf
wget http://monitoring.server/service/snmpd -O /etc/sysconfig/snmpd
wget http://monitoring.server/service/check_mk -O /etc/xinetd.d/check_mk
service snmpd restart
chkconfig snmpd on
echo "Add this line to /etc/sysconfig/iptables before REJECT"
echo "-A INPUT -s 10.100.10.5 -j ACCEPT"
echo "and then"
echo "service iptables restart"

 

My copy of /etc/xinetd.d/check_mk has “only_from = monitoring.server” line in it so we do have some sort of protection.

Modify local firewall ingress rules too, in order to accept check_mk 6556/udp traffic from management server only.


 

Defining client on server:

ssh monitoring.server
sudo su - mysite
vim ~/etc/check_mk/main.mk

See documentation for more details on how to fill this file.

 


 

Local checks, starting with Monitoring Available Bacula Tapes

Cronjob gathers information about tape library status twice a day:

crontab -l
0 8,16 * * * /usr/local/bin/baculaTapes.sh

 

Actual cron script:

#!/bin/bash
echo "status slots Storage=Autochanger"| bconsole > /var/tmp/bacula.out

 

Now, we use check_mk agent capability to run local checks with the following script, vim /usr/lib/check_mk_agent/local/bacula_tapes.sh

 

#!/bin/bash
ERRORTAPES=`grep Error /var/tmp/bacula.out|wc -l`
APPENDTAPES=`egrep 'Append|Purged' /var/tmp/bacula.out|wc -l`
FULLTAPES=`grep Full /var/tmp/bacula.out|wc -l`

if [ $APPENDTAPES -lt 2 ] ; then
status=2
statustxt=CRITICAL
elif [ $APPENDTAPES -lt 3 ] || [ $ERRORTAPES -gt 0 ] ; then
status=1
statustxt=WARNING
else
status=0
statustxt=OK
fi

echo "$status Backup_AvailableTapes APPENDTAPES=$APPENDTAPES;3;1;25; $statustxt - $APPENDTAPES Usable Tapes, $FULLTAPES Full Tapes, $ERRORTAPES Error Tapes"

Ubuntu security patches

On some server we don’t enable automated patching (especially Ubuntu!) but we want to know if there are security patches available in order to apply them in controlled manner. We rely on update-notifier-common so make sure it’s installed.

vim /usr/lib/check_mk_agent/local/ububtu-security-updates.sh

#!/bin/bash
SECUPD=`/usr/lib/update-notifier/update-motd-updates-available|tail -n2|head -n1|cut -d" " -f1`
SECTEXT=`/usr/lib/update-notifier/update-motd-updates-available |xargs`

if [ $SECUPD -gt 2 ] ; then
        status=2
        statustxt=CRITICAL
elif [ $SECUPD -gt 1 ] ; then
        status=2
        statustxt=WARNING
else
        status=0
        statustxt=OK
fi


echo "$status UBUNTU_SECURITY PACKAGES=$SECUPD;1;5 $statustxt - $SECTEXT"

 


SAMBA status

/usr/lib/check_mk_agent/local/samba-test.sh

#!/bin/bash
SESSIONS=`smbstatus -b -d 0|egrep '^[0-9]'|wc -l|tr -d ' '`
if [ $SESSIONS -gt 200 ] ; then
        status=2
        statustxt=CRITICAL
else
        status=0
        statustxt=OK
fi


echo "$status SAMBA USERS=$SESSIONS;200;300 $statustxt - $SESSIONS user sessions, `smbstatus -d0 |grep version`"

 


 

FreeNX

/usr/lib/check_mk_agent/local/FreeNXusers.sh

#!/bin/bash
SESSIONS=`nxserver --list|grep abc|wc -l`
STRING=`nxserver --status|tail -n2|head -n1`
STRINGOK="NX> 110 NX Server is running"
# STRINGOK="110 NX Server is running"
if [ "$STRING" == "$STRINGOK" ];
then
status=0
statustxt=OK
else
status=2
statustxt=CRITICAL
fi
echo "$status FreeNX SESSIONS=$SESSIONS;25;30 $statustxt - $STRING, $SESSIONS users currently logged in"

 

SunRAY

 
/usr/lib/check_mk_agent/local/SunRayUsers.sh

#!/bin/bash
CARDS=`/opt/SUNWut/sbin/utsession -p|egrep 'Payflex'|wc -l`
SESSIONS=`/opt/SUNWut/sbin/utsession -p|egrep 'Payflex|pseudo'|wc -l`

if [ $SESSIONS -lt 1 ] ; then
status=2
statustxt=CRITICAL
elif [ $SESSIONS -lt 3 ] ; then
status=1
statustxt=WARNING
else
status=0
statustxt=OK
fi
echo "$status SunRays USERS=$CARDS;28;30 $statustxt - $SESSIONS Sunrays connected, $CARDS users currently logged in"

 


ThinLinc server

/usr/lib/check_mk_agent/local/ThinLinc.sh

#!/bin/bash
SESSIONS=`who|wc -l`
ThinLinc=`tail -n1 /var/log/thinlinc-user-licenses`
SERVICES=`for i in vsmserver vsmagent; do service $i status;done|grep running|wc -l`

if [ $SERVICES -lt 2 ] ; then
status=2
statustxt=CRITICAL
else
status=0
statustxt=OK
fi

echo "$status ThinLinc USERS=$SESSIONS;28;30 $statustxt - $SESSIONS sessions, $ThinLinc"

 


 

Grid Engine queue status

 

Dirty script to check number of online workers. First root outputs status of the queue (so check_mk doesn’t need access to SGE environment) to temp file then check_mk local check scans output temp file.

vim /usr/local/bin/sge.sh

#!/bin/bash
. /etc/profile.d/sge.sh
/apps/sge/bin/lx24-amd64/qstat -f -u "*" > /tmp/sge.status

Trigger every 5 minutes

crontab -l
*/5 * * * * /usr/local/bin/sge.sh

Actual local check /usr/lib/check_mk_agent/local/sge.sh

#!/bin/bash
OFFLINE=`cat /tmp/sge.status|grep au|wc -l`
ONLINE=`cat /tmp/sge.status|grep BIP|wc -l`

if [ "$OFFLINE" == 0 ];
then
status=0
statustxt=OK
else
status=2
statustxt=CRITICAL
fi

echo "$status GridEngine OFFLINE=$OFFLINE;1;2 $statustxt - $ONLINE out of 3 SGE workers online"

 


 

Dell PERC RAID status

cronjob

*/15 * * * * /usr/local/bin/raid.sh

where /usr/local/bin/raid.sh

 

#!/bin/bash
omreport storage vdisk controller=0  > /var/tmp/raid.txt
omreport storage pdisk controller=0  >> /var/tmp/raid.txt

and actual check_mk local check /usr/lib/check_mk_agent/local/raid.sh has:

#!/bin/bash
ONLINEDISKS=`grep Online /var/tmp/raid.txt|wc -l`
VDISKSTATUS=`head -n10 /var/tmp/raid.txt|grep Status|awk -F" " '{print $3}'`

if [ $ONLINEDISKS -lt 3 ] ; then
status=2
statustxt=CRITICAL
else
status=0
statustxt=OK
fi

echo "$status PERC_H310_Mini_Status ONLINEDISKS=$ONLINEDISKS;3;1;25; $statustxt – $ONLINEDISKS Online Disks in 3x RAID5. Vdisk status $VDISKSTATUS"

 

HP Smart Array P400

vim /usr/local/bin/raid.sh

#!/bin/bash
/usr/sbin/hpacucli ctrl slot=0 physicaldrive all show status > /var/tmp/raid-p.txt
/usr/sbin/hpacucli ctrl slot=0 logicaldrive all show status > /var/tmp/raid-v.txt

crontab -l

*/5 * * * * /usr/local/bin/p400.sh

vim /usr/lib/check_mk_agent/local/raid.sh

#!/bin/bash
ONLINEDISKS=`grep physicaldrive /var/tmp/raid-p.txt|wc -l`
HOTSPARE=`grep spare /var/tmp/raid-p.txt|wc -l`
VDISKSTATUS=`grep logicaldrive /var/tmp/raid-v.txt`

if [ $ONLINEDISKS -lt 12 ] ; then
status=2
statustxt=CRITICAL
else
status=0
statustxt=OK
fi

# varname=value;warn;crit;min;max, while the values ;warn;crit;min;max are optional values.
# varname=value;warn;crit

echo "$status SmartArrayP400 ONLINEDISKS=$ONLINEDISKS;11;10 $statustxt - $ONLINEDISKS Online disks: $HOTSPARE HotSpare, Vdisk status:$VDISKSTATUS"

 


 

Supermicro LSI Mega RAID SAS 9240-8i

cron

*/5 * * * * /usr/local/bin/raid.sh

vim /usr/local/bin/raid.sh

#!/bin/bash
/usr/local/bin/MegaCli64 -PDList -aALL|grep "Firmware state"|wc -l > /var/tmp/raid-p.txt
/usr/local/bin/MegaCli64 -PDList -aALL|grep "Firmware state"|grep Hotspare|wc -l > /var/tmp/raid-hs.txt
/usr/local/bin/MegaCli64 -LDInfo -Lall -aALL > /var/tmp/raid-v.txt

vim /usr/lib/check_mk_agent/local/raid.sh

#!/bin/bash
ONLINEDISKS=`cat /var/tmp/raid-p.txt`
HOTSPARE=`cat /var/tmp/raid-hs.txt`
VDISKSTATUS=`cat /var/tmp/raid-v.txt|grep State|xargs`

if [ $ONLINEDISKS -lt 12 ] ; then
status=2
statustxt=CRITICAL
else
status=0
statustxt=OK
fi

# varname=value;warn;crit;min;max, while the values ;warn;crit;min;max are optional values.
# varname=value;warn;crit

echo "$status LSIMegaRAID9240-8i ONLINEDISKS=$ONLINEDISKS;13;12 $statustxt - $ONLINEDISKS Online disks: $HOTSPARE HotSpare, Vdisks: $VDISKSTATUS"

 


 

Bonus, monitoring ESXi with OMD

Not really a local check but took me a while to figure out hot to pull status from standalone ESXi server so adding for reference. Add user on ESXi and grant R/O access. Test from CLI to make sure we’ve got it right

/omd/sites/mysite/share/check_mk/agents/special/agent_vsphere -D --debug -u 'xxxxxxxxxx' -s 'xxxxxxxxx' -i hostsystem,virtualmachine,datastore,counters --timeout 5 esxhost01.domain

Add to etc/check_mk/main.mk:

#-- Custom checks for ESXi servers ---
datasource_programs.append((
"/omd/sites/mysite/share/check_mk/agents/special/agent_vsphere -u 'xxxxxxxxx' -s 'xxxxxxxxxxxx' "
"-i hostsystem,virtualmachine,datastore,counters --direct "
"--hostname '<HOST>' --timeout 5 <IP>", [ "esxhost01.domain" ]
))

 

Bonus 2: Monitoring execution of Bacula backup jobs

That mechanism pulls job status from MySQL database and flags up job that hasn’t been successfully executed during the last say 25h (for daily jobs).

File /opt/omd/sites/prod/etc/nagios/conf.d/templates.cfg has the following generic service defined:

define service{
 name bacula-prod-backup-generic-service 
 host_name prod-backup
 service_description check-production-backup-jobs
 check_command check_bacula_at_prod-backup
 servicegroups backup-servers
 active_checks_enabled 1 ; Active service checks are enabled
 passive_checks_enabled 1 ; Passive service checks are enabled/accepted
 parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
 obsess_over_service 1 ; We should obsess over this service (if necessary)
 check_freshness 0 ; Default is to NOT check service 'freshness'
 notifications_enabled 1 ; Service notifications are enabled
 event_handler_enabled 1 ; Service event handler is enabled
 failure_prediction_enabled 1 ; Failure prediction is enabled
 process_perf_data 1 ; Process performance data
 retain_status_information 1 ; Retain status information across program restarts
 retain_nonstatus_information 1 ; Retain non-status information across program restarts
 notification_interval 0 ; Only send notifications on status change by default.
 is_volatile 0
 check_period 24x7
 normal_check_interval 5
 retry_check_interval 1
 max_check_attempts 4
 notification_period 24x7
 notification_options w,u,c,r
 # contact_groups admins
 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
 }

File /opt/omd/sites/prod/etc/nagios/conf.d/bacula_prod-backup.cfg has one service per job defined

define service {
use bacula-prod-backup-generic-service 
service_description Backup: Studies2010-1
check_command check_bacula_at_prod-backup!744!1!1!Studies2010-1
}
define service {
use bacula-prod-backup-generic-service 
service_description Backup: Studies2010-2
check_command check_bacula_at_prod-backup!744!1!1!Studies2010-2
}
define service {
use bacula-prod-backup-generic-service 
service_description Backup: Studies2010-3
check_command check_bacula_at_prod-backup!744!1!1!Studies2010-3
}
and so on.

and file /opt/omd/sites/prod/etc/nagios/conf.d/commands.cfg has the following command defined

define command {
 command_name check_bacula_at_prod-backup
 command_line $USER2$/check_bacula_prod-backup.pl -H '$ARG1$' -w '$ARG2$' -c '$ARG3$' -j '$ARG4$'
}

Finally, Perl script /omd/sites/prod/local/lib/nagios/plugins/check_bacula_prod-backup.pl has bacula database connection details embedded (sql user with R/O privileges is fine). I got a copy of check_bacula.pl version 0.0.3 from Bacula website, submitted by Julian Hein NETWAYS GmbH and modified by Silver Salonen – good job gentlemen.


Bonus 3, monitoring NAS4FREE with OMD

Click here.

 

Thanks

And the final note, many thanks to Mathias Kettner and his team for providing this fantastic tool under GPL license. Well done chaps.


Jun 162015
 

Intro

Sadly, Zoneminder is not available in Debian 8. Here you can find my recipe for building debian package and installing Zoneminder, currently version 1.28.1 from git. Based on slightly amended instructions for Debian 8 Jessie from README.md

insta-16

Prerequisite

aptitude install -y apache2 mysql-server php5 php5-mysql build-essential libmysqlclient-dev libssl-dev libbz2-dev libpcre3-dev libdbi-perl libarchive-zip-perl libdate-manip-perl libdevice-serialport-perl libmime-perl libpcre3 libwww-perl libdbd-mysql-perl libsys-mmap-perl yasm automake autoconf libjpeg8 apache2-mpm-prefork libapache2-mod-php5 php5-cli libphp-serialization-perl libavcodec-dev libavformat-dev libswscale-dev libavutil-dev libv4l-dev libtool ffmpeg libnetpbm10-dev libavdevice-dev libmime-lite-perl dh-autoreconf dpatch

Build

cd /usr/src
git clone https://github.com/ZoneMinder/ZoneMinder.git zoneminder;
cd zoneminder
ln -s distros/debian8 debian
dpkg-checkbuilddeps # and apt install if anything is missing

vim zoneminder/debian/changelog # add new version on top with email address matching your GPG key and then

dpkg-buildpackage

 

Installation

One level above should hopefully find a deb package matching the architecture of the build host:

ls -l /usr/src/zoneminder*.deb
cd /usr/src
for i in `ls *.deb`;do echo "------------- $i";dpkg -i $i;done
apt-get install -f
dpkg-reconfigure zoneminder

 

Lets see:

systemctl status zoneminder

 zoneminder.service - ZoneMinder CCTV recording and security system
 Loaded: loaded (/lib/systemd/system/zoneminder.service; enabled)
 Active: active (running) since Tue 2015-06-16 13:28:12 BST; 3min 21s ago
 Main PID: 5602 (zmdc.pl)
 CGroup: /system.slice/zoneminder.service
 /usr/bin/perl -wT /usr/bin/zmdc.pl startup
 /usr/bin/perl -wT /usr/bin/zmfilter.pl
 /usr/bin/perl -wT /usr/bin/zmaudit.pl -c
 /usr/bin/perl -wT /usr/bin/zmwatch.pl
 /usr/bin/perl -w /usr/bin/zmupdate.pl -c
 Jun 16 13:28:11 sambor.home zmdc[5632]: INF ['zmaudit.pl -c' started at 15/06/16 13:28:11]
 Jun 16 13:28:11 sambor.home zmfilter[5626]: INF [Scanning for events]
 Jun 16 13:28:11 sambor.home zmdc[5638]: INF ['zmwatch.pl' started at 15/06/16 13:28:11]
 Jun 16 13:28:11 sambor.home zmdc[5602]: INF ['zmwatch.pl' starting at 15/06/16 13:28:11, pid = 5638]
 Jun 16 13:28:12 sambor.home zmwatch[5638]: INF [Watchdog starting]
 Jun 16 13:28:12 sambor.home zmwatch[5638]: INF [Watchdog pausing for 30 seconds]
 Jun 16 13:28:12 sambor.home zmdc[5602]: INF ['zmupdate.pl -c' starting at 15/06/16 13:28:12, pid = 5645]
 Jun 16 13:28:12 sambor.home zmdc[5645]: INF ['zmupdate.pl -c' started at 15/06/16 13:28:12]
 Jun 16 13:28:12 sambor.home zmupdate[5645]: INF [Checking for updates]
 Jun 16 13:28:12 sambor.home zmupdate[5645]: INF [Got version: '1.28.1']

Well, well, well. Look who’s up and running!

Configuration

Alright, we can now connect to server IP address and start adding monitors

http://10.10.0.231/zm/index.php

Start with checking your cameras:

zmu -d /dev/video0 -q -v

It took me a while to experiment with settings and find the most optimal configuration. My camera is connected via BT878 video card and it worked with

Capture Method v4l2
 Device Channel 1
 Device Format NTSC
 Capture Pallet Auto
 Target Colorspace 32bit
 Capture Width 768
 Capture Height 480
 # although Crop Capabilities in debug advertised it as 924 x 576
 

I use it in Mocord mode.

Last hint, make sure you have plenty of space to store data produced by Zoneminder.
I have changed symlinks /usr/share/zoneminder/{events,images,temp} and pointed them to dedicated zfs filesystem with ample of space:

for i in events images temp;do mkdir /tank/zoneminder/$i;ln -s /tank/zoneminder/$i /usr/share/zoneminder/$i:done
 chown www-data. /tank/zoneminder/*

Logging

Is default Zoneminder inclination to log bloody EVERYTHING to blooming EVERYWHERE driving you mad? Well, welcome to the club!

vim /etc/rsyslog.conf

and replace line 62 adding local1.!*

# *.*;auth,authpriv.none -/var/log/syslog
# As per http://www.zoneminder.com/wiki/index.php/Documentation#Logging
*.*;local1.!*;auth,authpriv.none -/var/log/syslog

And restart rsyslog

systemctl restart rsyslog