Monday, November 5, 2012

SETTING UP RAID 5 ON CLOUD / Amazon Linux. EC2

Few Days Back i was asked to optimize one of the elastic Search Node. I Identified one of the BottleNecks in DISKIO. Went to solve it with 5 disk with 4 stripes and 1 parity.

 

SETTING RAID LEVEL 5

INSTANCE

~$ ec2-describe-instances | grep XX.XX.XX.XXX
INSTANCE i-axxxxxxx ami-0xxxxxxx running Pradeep-Keys 0 m1.large 2012-10-30T11:44:20+0000 us-east-1c  monitoring-disabled XXXX vpc-sdsd2 subnet-1dsdsdsd ebs paravirtual xen sCFML1351596348533 sg-a99f80c5, sg-7ea0bf12 default false
NIC eni-4c1sdsds7 subnet-1dsdsd vpc-sdsd2 323323244233 in-use XX.XX.XX.XXX true
PRIVATEIPADDRESS XX.XX.XX.XXX

CREATING DISKS

~$ ec2-create-volume -s 128 -z us-east-1c
VOLUME vol-0 128 us-east-1c creating 2012-11-05T09:50:30+0000 standard
~$ ec2-create-volume -s 128 -z us-east-1c
VOLUME vol-4 128 us-east-1c creating 2012-11-05T09:51:08+0000 standard
~$ ec2-create-volume -s 128 -z us-east-1c
VOLUME vol-1 128 us-east-1c creating 2012-11-05T09:51:13+0000 standard
~$ ec2-create-volume -s 128 -z us-east-1c
VOLUME vol-5 128 us-east-1c creating 2012-11-05T09:51:17+0000 standard
~$ ec2-create-volume -s 128 -z us-east-1c
VOLUME vol-8 128 us-east-1c creating 2012-11-05T09:51:22+0000 standard

ATTACH THE VOLUMES

saurajeet.d:~$ ec2-attach-volume vol-0 -i i-axxxxxxx -d sdf 
ATTACHMENT vol-0 i-axxxxxxx sdf attaching 2012-11-05T10:02:43+0000
saurajeet.d:~$ ec2-attach-volume vol-4 -i i-axxxxxxx -d sdg
ATTACHMENT vol-4 i-axxxxxxx sdg attaching 2012-11-05T10:03:35+0000
saurajeet.d:~$ ec2-attach-volume vol-1 -i i-axxxxxxx -d sdh
ATTACHMENT vol-1 i-axxxxxxx sdh attaching 2012-11-05T10:04:13+0000
saurajeet.d:~$ ec2-attach-volume vol-5 -i i-axxxxxxx -d sdi
ATTACHMENT vol-5 i-axxxxxxx sdi attaching 2012-11-05T10:04:49+0000
saurajeet.d:~$ ec2-attach-volume vol-8 -i i-axxxxxxx -d sdj
ATTACHMENT vol-8 i-axxxxxxx sdj attaching 2012-11-05T10:05:19+0000

STARTING THE RAID SERVICE 

[saurajeet.d@es1 ~]$ sudo mdadm --create --verbose /dev/md0 --level=5 --raid-devices=5 /dev/xvdf /dev/xvdg /dev/xvdh /dev/xvdi /dev/xvdj
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 512K
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: size set to 134216192K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
[saurajeet.d@es1 ~]$

SET THE ARRAY TO START AT BOOT, THEREBY POPULATING THE CONF FILE

[saurajeet.d@es1 ~]$ sudo mdadm --detail --scan 
ARRAY /dev/md0 metadata=1.2 spares=1 name=0 UUID=e47f9d29:cf9f0299:e2305a7c:0d03d014
  
[saurajeet.d@es1 ~]$ sudo vim /etc/mdadm.conf
[saurajeet.d@es1 ~]$ cat /etc/mdadm.conf 
DEVICE /dev/xvd[f-j]
ARRAY /dev/md0 metadata=1.2 spares=1 name=0 UUID=e47f9d29:cf9f0299:e2305a7c:0d03d014

SET READAHEAD BUFFER TO 65K

[saurajeet.d@es1 ~]$ sudo blockdev --setra 65536 /dev/md0
[saurajeet.d@es1 ~]$

DEFAULT BLOCK SIZE TAKEN WAS

[saurajeet.d@es1 ~]$ cat /sys/block/md0/queue/physical_block_size 
512

CREATING FILE SYSTEM ONTO THE ARRAY

[saurajeet.d@es1 ~]$ sudo mkfs.xfs /dev/md0
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md0 isize=256 agcount=16, agsize=8388480 blks
 = sectsz=512 attr=2
data = bsize=4096 blocks=134215680, imaxpct=25
 = sunit=128 swidth=512 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=65536, version=2
 = sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
You can calculate optimal options using a calculator here: http://uclibc.org/~aldot/mkfs_stride.html in case u r using ext based file systems
You can also use tune2fs to tune out the filesystem options past this point.

BUILDING RAID

[saurajeet.d@es1 ~]$ sudo cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 xvdj[5] xvdi[3] xvdh[2] xvdg[1] xvdf[0]
 536864768 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_]
 [====>................] recovery = 22.1% (29781288/134216192) finish=125.9min speed=13822K/sec
 
unused devices: <none>

YOU CAN CHECK THE HEALTH OF THE ARRAY

[saurajeet.d@es1 ~]$ sudo mdadm --detail /dev/md0
/dev/md0:
 Version : 1.2
 Creation Time : Mon Nov 5 10:19:55 2012
 Raid Level : raid5
 Array Size : 536864768 (511.99 GiB 549.75 GB)
 Used Dev Size : 134216192 (128.00 GiB 137.44 GB)
 Raid Devices : 5
 Total Devices : 5
 Persistence : Superblock is persistent
Update Time : Mon Nov 5 11:00:47 2012
 State : clean, degraded, recovering 
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
 Spare Devices : 1
Layout : left-symmetric
 Chunk Size : 512K
Rebuild Status : 30% complete
Name : 0
 UUID : e47f9d29:cf9f0299:e2305a7c:0d03d014
  
 Events : 9
Number Major Minor RaidDevice State
 0 202 80 0 active sync /dev/sdf
 1 202 96 1 active sync /dev/sdg
 2 202 112 2 active sync /dev/sdh
 3 202 128 3 active sync /dev/sdi
 5 202 144 4 spare rebuilding /dev/sdj

NOW THAT THE RAID BUILDUP IS DONE. IF NOT WAIT TILL `cat /proc/mdstat` shows out 100% rebuilt

We have a potential problem. All storage devices are booted during the kernel loading process. Sometimes errors occur which involves the md0 to rename to md127 or some number. That is because Software RAID is a kernel module which needs takes time out of the running kernel and this missing timeline to do unexpected results. We solve this problem by putting mdadm.conf to the ramfs.
ramfs is a vital peice of information for booting. Any Fuckup may render the system unbootable.
NOTICE MATCHING VMLINUZ AND INITRD VERSION NUMBERS
[saurajeet.d@es1 ~]$ cd /boot
[saurajeet.d@es1 boot]$ ls
config-3.2.21-1.32.6.amzn1.x86_64 grub initramfs-3.2.21-1.32.6.amzn1.x86_64.img System.map-3.2.21-1.32.6.amzn1.x86_64 vmlinuz-3.2.21-1.32.6.amzn1.x86_64
[saurajeet.d@es1 boot]$ lsinitrd initramfs-3.2.21-1.32.6.amzn1.x86_64 | grep mdadm.conf 
[saurajeet.d@es1 boot]
NOTICE NO FILE CALLED MDADM.CONF found. WE NEED TO INCLUDE THIS FILE USING THE DRACUT FACILITY.
[root@es1 boot]# cp initramfs-3.2.21-1.32.6.amzn1.x86_64.img initramfs-3.2.21-1.32.6.amzn1.x86_64.img.1 
[root@es1 boot]# dracut "initramfs-$(uname -r).img" $(uname -r)
E: Will not override existing initramfs (/boot/initramfs-3.2.21-1.32.6.amzn1.x86_64.img) without --force
[root@es1 boot]# dracut "initramfs-$(uname -r).img" $(uname -r) --force
E: Will not override existing initramfs (/boot/initramfs-3.2.21-1.32.6.amzn1.x86_64.img) without --force
[root@es1 boot]# ls -l initramfs-3.2.21-1.32.6.amzn1.x86_64.img
-rw-r--r-- 1 root root 5955095 Jun 25 22:51 initramfs-3.2.21-1.32.6.amzn1.x86_64.img
[root@es1 boot]# rm initramfs-3.2.21-1.32.6.amzn1.x86_64.img
rm: remove regular file `initramfs-3.2.21-1.32.6.amzn1.x86_64.img'? y
[root@es1 boot]# dracut "initramfs-$(uname -r).img" $(uname -r)
[root@es1 boot]# ls -l
total 18684
-rw-r--r-- 1 root root 60125 Jun 23 02:34 config-3.2.21-1.32.6.amzn1.x86_64
drwxr-xr-x 2 root root 4096 Sep 4 12:14 grub
-rw-r--r-- 1 root root 8963954 Nov 5 14:35 initramfs-3.2.21-1.32.6.amzn1.x86_64.img
-rw-r--r-- 1 root root 5955095 Nov 5 14:32 initramfs-3.2.21-1.32.6.amzn1.x86_64.img.1
-rw------- 1 root root 1402139 Jun 23 02:34 System.map-3.2.21-1.32.6.amzn1.x86_64
-rwxr-xr-x 1 root root 2739088 Jun 23 02:34 vmlinuz-3.2.21-1.32.6.amzn1.x86_64
[root@es1 boot]# sudo reboot^C
[root@es1 boot]# lsinitrd initramfs-3.2.21-1.32.6.amzn1.x86_64.img | grep mdadm
-rwxr-xr-x 1 root root 451344 Mar 16 2012 sbin/mdadm
-rwxr-xr-x 1 root root 106 Jan 6 2012 sbin/mdadm_auto
-rw-r--r-- 1 root root 106 Nov 5 10:23 etc/mdadm.conf
You can try
draconf mdadmconf="{yes|no}"
if the dracut execution still doesnt include mdadm.conf

REBOOT AND TEST

saurajeet@saurajeet-THX:~$ ssh 10.10.XX.XXXX
Last login: Mon Nov 5 13:42:42 2012 from XX.XX.XX.XXX
__| __|_ )
 _| ( / Amazon Linux AMI
 ___|\___|___|
  https://aws.amazon.com/amazon-linux-ami/2012.03-release-notes/
  
There are 30 security update(s) out of 167 total update(s) available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2012.09 is available.
[saurajeet.d@es1 ~]$ ls /dev/md
md/ md0 
[saurajeet.d@es1 ~]$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 xvdh[2] xvdj[5] xvdf[0] xvdi[3] xvdg[1]
 536864768 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
 
unused devices: <none>
[saurajeet.d@es1 ~]$ sudo mdadm -D /dev/md0
/dev/md0:
 Version : 1.2
 Creation Time : Mon Nov 5 10:19:55 2012
 Raid Level : raid5
 Array Size : 536864768 (511.99 GiB 549.75 GB)
 Used Dev Size : 134216192 (128.00 GiB 137.44 GB)
 Raid Devices : 5
 Total Devices : 5
 Persistence : Superblock is persistent
Update Time : Mon Nov 5 13:03:33 2012
 State : clean 
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
 Spare Devices : 0
Layout : left-symmetric
 Chunk Size : 512K
Name : 0
 UUID : e47f9d29:cf9f0299:e2305a7c:0d03d014
  
 Events : 26
Number Major Minor RaidDevice State
 0 202 80 0 active sync /dev/sdf
 1 202 96 1 active sync /dev/sdg
 2 202 112 2 active sync /dev/sdh
 3 202 128 3 active sync /dev/sdi
 5 202 144 4 active sync /dev/sdj
[saurajeet.d@es1 ~]$

TROUBLE SHOOT

If the RAID is not builtup during the bootup time. We may try mdadm -I which is script to automatically assemble raid from configuration files. If it fails for some reason lookup mdadm -As (Assemble and Scan), which does a manual scanning for drives to get a RAID timestamp and attempts rebuilding from there.

COMPLETE THE SETUP WITH MOUNTING

[saurajeet.d@es1 ~]$ sudo cat /etc/fstab
#
LABEL=/ / ext4 defaults,noatime 1 1
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
/dev/sda3 none swap sw,comment=cloudconfig 0 0
/dev/md0 /data auto defaults,noatime 0 0

MOUNT

[saurajeet.d@es1 ~]$ sudo mkdir /data
[saurajeet.d@es1 ~]$ sudo mount /dev/md0
[saurajeet.d@es1 ~]$ ls /data -l
total 0
[saurajeet.d@es1 ~]$ mount
/dev/xvda1 on / type ext4 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
/dev/md0 on /data type xfs (rw,noatime)
[saurajeet.d@es1 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 8.0G 1.1G 6.9G 14% /
tmpfs 3.7G 0 3.7G 0% /dev/shm
/dev/md0 512G 33M 512G 1% /data
[saurajeet.d@es1 ~]$

WHY 512G

Capacity of RAID = Smin * (n - 1) => 128G * (5 -1) => 128G * 4 => 512G

FOLLOW UPS

http://www.tldp.org/HOWTO/Software-RAID-HOWTO-5.html
http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5
http://zackreed.me/articles/38-software-raid-5-in-debian-with-mdadm
https://bugzilla.redhat.com/show_bug.cgi?id=606481
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/sec-Verifying_the_Initial_RAM_Disk_Image.html
FUTURE TASKS
CANT INCUDE CODE FOR AMAZON CLOUD FORMATION. You KNOW WHY, Happy RAIDING