Sep 232016
 

Intro

Back in 2011 we bought Coraid SR2421, simple, fast and relatively cheap network storage device based on AOE (ATA over Ethernet) technology. AoE is lightweight, secure protocol that allows hardware to perform at maximum capacity by sitting directly on top of the Ethernet layer. Its fast because there is no routing, no IP, TCP, or iSCSI to slow down traffic. Quite cool if you ask me.

In 2015 company behind Coraid went busted but I kept using this box as it was (and still is!) rock solid. Now apparently Coraid is back as its intellectual property and trademark was purchased by the founder of AoE’s new company, SouthSuite. They changed their model though, instead of selling appliances they sell software only – which makes sense, you don’t need a middleman selling you pricey hardware when you can get the same equipment directly from the manufacturer for less. Using Coraid software you could even buy used hardware and still get better performance than expensive proprietary SAN devices.

Some features of Coraid:

  • Mix and match SSD, SAS, and SATA, including 4K Advanced Format and NVMe.
  • Unlimited capacity dictated by disk choice.
  • JBOD or RAID 0, 1, 5, 6, or 10 with multiple automatic global spares.
  • 1G or 10G SFP+, CX4, and twisted pair NIC options.
  • Command Line Interface enables simple provisioning, configuration, and monitoring.
  • No TCP/IP overhead.
  • Unlimited scale-out storage capacity and IOPS performance.
  • Supported natively in Linux kernel since 2005.
  • VMware HBAs are available.
  • 10x price and performance advantage over Fibre Channel and iSCSI.

What’s even cooler is that they even published guides on how to build your own Coraid box!

 

Connecting

SR2421 has 2x 1Gbps network cards that can be connected to the switch if you plan connecting more servers. Coraid will automagically use all interfaces to maximize throughput.

I’ve got a single Linux file server with4x 1Gbps. 2x NICs are directly connected  to Coraid and remaining 2x NICs are bonded and used to provide service. Important things to remember:

  • make sure you bring Coraid facing interfaces up and set MTU to 9000 (jumbo frames) during system boot. Example script here.
  • use init script like this one here to mount stuff

Note, Linux box connected to Coraid doesn’t need IP address on the Coraid network card/cards – all data communication happens on lower, Ethernet protocol.

But you need to set IP on Coraid-connected interface if you want syslog feature to work, see Monitoring section below.

 

Coraid CLI commands

You can either hook up KVM directly to Coraid box or use Coraid Ethernet Console (cec). Once you get to CLI:

 

SR shelf 0> iostats -d
SR shelf 0> iostats -l
SR shelf 0> sos
SR shelf 0> ifstat -a
SR shelf 0> disks -a
SR shelf 0> make 2 raid5 0.12-16
SR shelf 0> make 3 raid5 0.17-21

RAID levels

• raidL—A linear RAID
• raid0—A striped RAID
• raid1—A mirrored RAID
• raid5—A round-robin parity RAID
• raid6rs—A double fault-tolerant round-robin parity RAID using Reed-Solomon syndromes.
• raid10—A stripe of mirrors RAID.

JBOD

SR shelf 0> jbod 1.0
SR shelf 0> make 0 raidL 1.0
SR shelf 0> online 0

 

Creating LUN

SR shelf 0> make 2 raid5 0.12-16 
SR shelf 0> make 3 raid5 0.17-21
SR shelf 0> online 2 
SR shelf 0> online 3
SR shelf 0> label 2 Data Vol 3
SR shelf 0> list    
 0 15002.965GB online 'Data Vol 1'
 1 15002.965GB online 'Data Vol 2'
 2 12002.372GB online 'Data Vol 3'
 3 12002.372GB online 'Data Vol 4'


SR shelf 0> list -l
 0 15002.965GB online 'Data Vol 1'
  0.0   15002.965GB raid5 normal 
    0.0.0  normal   3000.593GB 0.0 
    0.0.1  normal   3000.593GB 0.1 
    0.0.2  normal   3000.593GB 0.2 
    0.0.3  normal   3000.593GB 0.3 
    0.0.4  normal   3000.593GB 0.4 
    0.0.5  normal   3000.593GB 0.5 
 1 15002.965GB online 'Data Vol 2'
  1.0   15002.965GB raid5 normal 
    1.0.0  normal   3000.593GB 0.6 
    1.0.1  normal   3000.593GB 0.7 
    1.0.2  normal   3000.593GB 0.8 
    1.0.3  normal   3000.593GB 0.9 
    1.0.4  normal   3000.593GB 0.10 
    1.0.5  normal   3000.593GB 0.11 
 2 12002.372GB online 'Data Vol 3'
  2.0   12002.372GB raid5 normal 
    2.0.0  normal   3000.593GB 0.12 
    2.0.1  normal   3000.593GB 0.13 
    2.0.2  normal   3000.593GB 0.14 
    2.0.3  normal   3000.593GB 0.15 
    2.0.4  normal   3000.593GB 0.16 
 3 12002.372GB online 'Data Vol 4'
  3.0   12002.372GB raid5 normal 
    3.0.0  normal   3000.593GB 0.17 
    3.0.1  normal   3000.593GB 0.18 
    3.0.2  normal   3000.593GB 0.19 
    3.0.3  normal   3000.593GB 0.20 
    3.0.4  normal   3000.593GB 0.21 


SR shelf 7> when
0.0 1.29% 235073 KBps 0:46:06 left

Hot Spares

SR shelf 0> spare                
SR shelf 0> spare 0.22-23   
SR shelf 0> rmspare 7.6
SR shelf 0> replace 8.0.1 7.12

Restricting access with mac and vlan commands

SR shelf 0> vlan 3 100
SR shelf 0> vlan 4 200
SR shelf 0> vlan 3
3	100

SRX shelf 7> mask -?
usage: mask lun ... [ +mac ... ] [ -mac ... ]

Other useful commands

SR shelf 0> slotled 0 locate # or fault,rebuild,reset,spare

SR shelf 0> fans
FAN#      RPM
fan0     4623
fan1     4545
fan2     4591
fan3     3781
fan4     3970


SR shelf 0> power
PSU#    STATUS  TEMP   FAN1RPM
ps0         up   41C     12366
ps1         up   40C     12480

SR shelf 0> temp
LOCATION   TEMP
cpu         63C
ps0         41C
ps1         41C

# smart status

SR shelf 0> disks -s 
DISK     STATUS
0.0      normal
0.1      normal
0.2      normal
0.3      normal
0.4      normal
0.5      normal
0.6      normal
0.7      normal
0.8      normal
0.9      normal

SR shelf 0> eject 4
Are you sure you want to perform this action? y/n? [N]
Ejecting lun(s): 4

# eject command is useful when you want to move a LUN from one shelf to another without shutting down the SR

Using Coraid LUN as LVM backend

When you create new LUN on Coraid console it should be detected by Linux kernel:

# Just making sure we have aoe tools present
root@file-srv:~ # yum info aoetools
Name : aoetools
Arch : x86_64
Version : 36
Release : 3.el6
Size : 90 k
Repo : installed
From repo : epel
Summary : ATA over Ethernet Tools
URL : http://aoetools.sourceforge.net
License : GPLv2
Description : The aoetools are programs that assist in using ATA over Ethernet on
 : systems with version 2.6 and newer Linux kernels.


root@file-srv:~ # aoe-stat 
 e0.0 15002.964GB em2,em1 8704 up 
 e0.1 15002.964GB em2,em1 8704 up 
 e0.2 12002.371GB em1,em2 8704 up 
 e0.3 12002.371GB em1,em2 8704 up 

# other tools in the package: aoe-discover, aoe-version, aoe-mkshelf

root@file-srv:~ # dmesg 

aoe: e0.2: setting 8704 byte data frames
aoe: 00259022357e e0.2 vace0 has 23442132477 sectors
 etherd/e0.2: unknown partition table
aoe: e0.3: setting 8704 byte data frames
aoe: 00259022357e e0.3 vace0 has 23442132477 sectors
 etherd/e0.3: unknown partition table

OK, so we are looking at working with device e0.2 and e0.3

root@file-srv:~ # parted /dev/etherd/e0.2 

(parted) mktable gpt                                                      
(parted) p                                                                
Model: Unknown (unknown)
Disk /dev/etherd/e0.3: 12.0TB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start  End  Size  File system  Name  Flags

(parted) mkpart primary ext4 0% 100%
(parted) set 1 lvm on                                                     
(parted) p                                                                
Model: Unknown (unknown)
Disk /dev/etherd/e0.3: 12.0TB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  12.0TB  12.0TB               primary  lvm

(parted)                                                                  

We have a partition created, lets use it to expand existing LVM volume group

root@file-srv:~ # pvcreate /dev/etherd/e0.2p1 
  Physical volume "/dev/etherd/e0.2p1" successfully created

root@file-srv:~ # pvcreate /dev/etherd/e0.3p1 
  Physical volume "/dev/etherd/e0.3p1" successfully created

root@file-srv:~ # vgextend coraid /dev/etherd/e0.2p1
  Volume group "coraid" successfully extended

root@file-srv:~ # vgextend coraid /dev/etherd/e0.3p1
  Volume group "coraid" successfully extended

root@file-srv:~ # vgs
  VG        #PV #LV #SN Attr   VSize   VFree 
  coraid      4  14   0 wz--n-  49.12t 27.69t


root@file-srv:~ # vgdisplay -v coraid



--- Physical volumes ---
  PV Name               /dev/etherd/e0.1p1     
  PV UUID               OmPkNP-HJmi-xaeda-1obx-iFHW-n84f-7tbbfo
  PV Status             allocatable
  Total PE / Free PE    3576985 / 748850
   
  PV Name               /dev/etherd/e0.0p1     
  PV UUID               y2W3GJ-Rp0E-gxgE-asd-fpQM-DT26-n6Sz0C
  PV Status             allocatable
  Total PE / Free PE    3576985 / 786432
   
  PV Name               /dev/etherd/e0.2p1     
  PV UUID               f5pUm2-asqw-tO1q-962I-rGDJ-LzZi-b20zft
  PV Status             allocatable
  Total PE / Free PE    2861587 / 2861587
   
  PV Name               /dev/etherd/e0.3p1     
  PV UUID               94ABPd-ucTT-8vg6-WX5R-awew-Mw4b-dYdoHB
  PV Status             allocatable
  Total PE / Free PE    2861587 / 2861587
   

All done. Nice and easy. We can now get on with creating LVM volumes, etc.

 

Monitoring

One important aspect to bear in mind is Coraid monitoring. My dirty way of solving that is to configure Coraid to send logs to syslog (UDP port 514) on connected server and then run cronjob that will alert me on any Coraid-related event.

Preparing Coraid

# syslog -cp ServerDestinationIP CoraidSourceIP LocalSRinterface
syslog -cp 10.0.0.1 10.0.0.100 ether0

Preparing Linux server, cronjob script that runs every hour or so:

#!/bin/bash
#########################################################################################
# 2011-09-01: Quick and dirty, monitoring coraid
# Send bugreports, fixes, enhancements, t-shirts, money, beer & pizza to devnull at mielnet.pl"
TODAYONLY=`date "+%b %e"`
grep "$TODAYONLY" /var/log/messages|grep shelf > /tmp/coraid.out
######### if coraid.out exists and size > 0
if [ -s /tmp/coraid.out ]
 then 
######## send it to me
	cat /tmp/coraid.out |mail -s "Problem with ABC Coraid" [email protected]
######## otherwise just go away
 else exit 0
fi
######### cleanup
rm /tmp/coraid.out
######################## eof ############################################################

 

 

 

Why

I’m a big fan of this solution. If you are still not convinced, I’ll show you why:

SR shelf 0> uptime
 up 632 days, 17:36:38
Sep 162016
 

My notes for installing Son of Grid Engine (SGE) on commodity cluster.

golden_h

Intro

Grab from here  the following RPM packages:

gridengine-8.1.9-1.el6.x86_64.rpm
gridengine-debuginfo-8.1.9-1.el6.x86_64.rpm
gridengine-devel-8.1.9-1.el6.noarch.rpm
gridengine-drmaa4ruby-8.1.9-1.el6.noarch.rpm
gridengine-execd-8.1.9-1.el6.x86_64.rpm
gridengine-guiinst-8.1.9-1.el6.noarch.rpm
gridengine-qmaster-8.1.9-1.el6.x86_64.rpm
gridengine-qmon-8.1.9-1.el6.x86_64.rpm

(at the time of writing version 8.1.9).

For your convenience, the following one liner should fetch these for you 🙂

cd /tmp; for i in gridengine-8.1.9-1.el6.x86_64.rpm gridengine-debuginfo-8.1.9-1.el6.x86_64.rpm gridengine-devel-8.1.9-1.el6.noarch.rpm gridengine-drmaa4ruby-8.1.9-1.el6.noarch.rpm gridengine-execd-8.1.9-1.el6.x86_64.rpm gridengine-guiinst-8.1.9-1.el6.noarch.rpm gridengine-qmaster-8.1.9-1.el6.x86_64.rpm gridengine-qmon-8.1.9-1.el6.x86_64.rpm; do wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/$i;done

Pick one server that will be serving as a master node in your cluster, referred later as qmaster.
For smaller clusters it can happily run on small VM (say 2x vCPU, 2GB RAM) maximising your resource usage.

Install EPEL on all nodes

rpm -Uvh http://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm

Install prerequisits on all nodes

yum install -y perl-Env.noarch perl-Exporter.noarch perl-File-BaseDir.noarch perl-Getopt-Long.noarch perl-libs perl-POSIX-strptime.x86_64 perl-XML-Simple.noarch jemalloc munge-libs hwloc lesstif csh ruby xorg-x11-fonts xterm java xorg-x11-fonts-ISO8859-1-100dpi xorg-x11-fonts-ISO8859-1-75dpi mailx

Install GridEngine packages on all nodes

cd /tmp/
yum localinstall gridengine-*

Install Qmaster

cd /opt/sge
./install_qmaster

Accepting defaults should just work, well you might want to run it under different user than r00t so:

"Please enter a valid user name >> sgeadmin"

Make sure to add GridEngine to global environment:

cp /opt/sge/default/common/settings.sh /etc/profile.d/sge.sh

NFS export SGE root to nodes in your cluster

vim /etc/exports

/opt/sge 10.10.80.0/255.255.255.0(rw,no_root_squash,sync,no_subtree_check,nohide)

and mount share on exec nodes

vim /etc/fstab

qmaster:/opt/sge 	/opt/sge nfs	tcp,intr,noatime	0	0

 

Installing exec nodes

cd /opt/sge
./install_execd

Just go with the flow here. Once done you should be able to see your exec nodes:

# qhost 
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
execnode01              lx-amd64        8    2    8    8  0.12   15.6G    5.2G   20.0G  104.9M
execnode02        	lx-amd64        8    2    8    8  0.00   15.7G    1.3G   21.1G     0.0
execnode03              lx-amd64        8    2    8    8  0.00   15.7G    1.4G   21.1G   18.6M

That means you can start submitting jobs to your cluster, either interactive with qlogin or qrsh or batch jobs with qsub.

Adding queues (for FSL)

In most cases it’s enough to have a default queue called all.q

This example will define new queues with different priorities (nice levels):

# change defaults for all.q
qconf -sq all.q |\
    sed -e 's/bin\/csh/bin\/sh/' |\
    sed -e 's/posix_compliant/unix_behavior/' |\
    sed -e 's/priority              0/priority 20/' >\
    /tmp/q.tmp
qconf -Mq /tmp/q.tmp

# add other queues
sed -e 's/all.q/verylong.q/' /tmp/q.tmp >\
   /tmp/verylong.q
qconf -Aq /tmp/verylong.q

sed -e 's/all.q/long.q/' /tmp/q.tmp |\
   sed -e 's/priority *20/priority 15/' >\
   /tmp/long.q
qconf -Aq /tmp/long.q

sed -e 's/all.q/short.q/' /tmp/q.tmp |\
   sed -e 's/priority *20/priority 10/' >\
   /tmp/short.q
qconf -Aq /tmp/short.q

sed -e 's/all.q/veryshort.q/' /tmp/q.tmp |\
   sed -e 's/priority *20/priority 5/' >\
   /tmp/veryshort.q
qconf -Aq /tmp/veryshort.q

Monitoring your cluster

 

Use qmon GUI or the following commands:

# qstat -f

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@execnode01 BIP   0/0/8          0.12     lx-amd64      
---------------------------------------------------------------------------------
all.q@execnode02 BIP   0/0/8          0.00     lx-amd64      
---------------------------------------------------------------------------------
all.q@execnode03 BIP   0/0/8          0.00     lx-amd64      
---------------------------------------------------------------------------------
long.q@execnode01 BIP   0/0/8          0.12     lx-amd64      
---------------------------------------------------------------------------------
long.q@execnode02 BIP   0/0/8          0.00     lx-amd64      
---------------------------------------------------------------------------------
long.q@execnode03 BIP   0/0/8          0.00     lx-amd64      
---------------------------------------------------------------------------------
short.q@execnode01 BIP   0/0/8          0.12     lx-amd64      
---------------------------------------------------------------------------------
short.q@execnode02 BIP   0/0/8          0.00     lx-amd64      
---------------------------------------------------------------------------------
short.q@execnode03 BIP   0/0/8          0.00     lx-amd64      
---------------------------------------------------------------------------------
verylong.q@execnode01 BIP   0/0/8          0.12     lx-amd64      
---------------------------------------------------------------------------------
verylong.q@execnode02 BIP   0/0/8          0.00     lx-amd64      
---------------------------------------------------------------------------------
verylong.q@execnode03 BIP   0/0/8          0.00     lx-amd64      
---------------------------------------------------------------------------------
veryshort.q@execnode01 BIP   0/0/8          0.12     lx-amd64      
---------------------------------------------------------------------------------
veryshort.q@execnode02 BIP   0/0/8          0.00     lx-amd64      
---------------------------------------------------------------------------------
veryshort.q@execnode03 BIP   0/0/8          0.00     lx-amd64   

# qhost -q

HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
execnode01            lx-amd64        8    2    8    8  0.12   15.6G    5.2G   20.0G  104.9M
   all.q                BIP   0/0/8         
   long.q               BIP   0/0/8         
   short.q              BIP   0/0/8         
   veryshort.q          BIP   0/0/8         
   verylong.q           BIP   0/0/8         
execnode02        lx-amd64        8   2    8    8  0.00   15.7G    1.3G   21.1G     0.0
   all.q                BIP   0/0/8         
   long.q               BIP   0/0/8         
   short.q              BIP   0/0/8         
   veryshort.q          BIP   0/0/8         
   verylong.q           BIP   0/0/8         
execnode03                lx-amd64    8    2    8    8  0.00   15.7G    1.4G   21.1G   18.6M
   all.q                BIP   0/0/8         
   long.q               BIP   0/0/8         
   short.q              BIP   0/0/8         
   veryshort.q          BIP   0/0/8         
   verylong.q           BIP   0/0/8