Few Days Back i was asked to optimize one of the elastic Search Node. I Identified one of the BottleNecks in DISKIO. Went to solve it with 5 disk with 4 stripes and 1 parity.
SETTING RAID LEVEL 5
INSTANCE
~$ ec2-describe-instances | grep XX.XX.XX.XXX INSTANCE i-axxxxxxx ami-0xxxxxxx running Pradeep-Keys 0 m1.large 2012-10-30T11:44:20+0000 us-east-1c monitoring-disabled XXXX vpc-sdsd2 subnet-1dsdsdsd ebs paravirtual xen sCFML1351596348533 sg-a99f80c5, sg-7ea0bf12 default false NIC eni-4c1sdsds7 subnet-1dsdsd vpc-sdsd2 323323244233 in-use XX.XX.XX.XXX true PRIVATEIPADDRESS XX.XX.XX.XXX
CREATING DISKS
~$ ec2-create-volume -s 128 -z us-east-1c VOLUME vol-0 128 us-east-1c creating 2012-11-05T09:50:30+0000 standard ~$ ec2-create-volume -s 128 -z us-east-1c VOLUME vol-4 128 us-east-1c creating 2012-11-05T09:51:08+0000 standard ~$ ec2-create-volume -s 128 -z us-east-1c VOLUME vol-1 128 us-east-1c creating 2012-11-05T09:51:13+0000 standard ~$ ec2-create-volume -s 128 -z us-east-1c VOLUME vol-5 128 us-east-1c creating 2012-11-05T09:51:17+0000 standard ~$ ec2-create-volume -s 128 -z us-east-1c VOLUME vol-8 128 us-east-1c creating 2012-11-05T09:51:22+0000 standard
ATTACH THE VOLUMES
saurajeet.d:~$ ec2-attach-volume vol-0 -i i-axxxxxxx -d sdf ATTACHMENT vol-0 i-axxxxxxx sdf attaching 2012-11-05T10:02:43+0000 saurajeet.d:~$ ec2-attach-volume vol-4 -i i-axxxxxxx -d sdg ATTACHMENT vol-4 i-axxxxxxx sdg attaching 2012-11-05T10:03:35+0000 saurajeet.d:~$ ec2-attach-volume vol-1 -i i-axxxxxxx -d sdh ATTACHMENT vol-1 i-axxxxxxx sdh attaching 2012-11-05T10:04:13+0000 saurajeet.d:~$ ec2-attach-volume vol-5 -i i-axxxxxxx -d sdi ATTACHMENT vol-5 i-axxxxxxx sdi attaching 2012-11-05T10:04:49+0000 saurajeet.d:~$ ec2-attach-volume vol-8 -i i-axxxxxxx -d sdj ATTACHMENT vol-8 i-axxxxxxx sdj attaching 2012-11-05T10:05:19+0000
STARTING THE RAID SERVICE
[saurajeet.d@es1 ~]$ sudo mdadm --create --verbose /dev/md0 --level=5 --raid-devices=5 /dev/xvdf /dev/xvdg /dev/xvdh /dev/xvdi /dev/xvdj mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: chunk size defaults to 512K mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: size set to 134216192K mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. [saurajeet.d@es1 ~]$
SET THE ARRAY TO START AT BOOT, THEREBY POPULATING THE CONF FILE
[saurajeet.d@es1 ~]$ sudo mdadm --detail --scan ARRAY /dev/md0 metadata=1.2 spares=1 name=0 UUID=e47f9d29:cf9f0299:e2305a7c:0d03d014 [saurajeet.d@es1 ~]$ sudo vim /etc/mdadm.conf [saurajeet.d@es1 ~]$ cat /etc/mdadm.conf DEVICE /dev/xvd[f-j] ARRAY /dev/md0 metadata=1.2 spares=1 name=0 UUID=e47f9d29:cf9f0299:e2305a7c:0d03d014
SET READAHEAD BUFFER TO 65K
[saurajeet.d@es1 ~]$ sudo blockdev --setra 65536 /dev/md0 [saurajeet.d@es1 ~]$
DEFAULT BLOCK SIZE TAKEN WAS
[saurajeet.d@es1 ~]$ cat /sys/block/md0/queue/physical_block_size 512
CREATING FILE SYSTEM ONTO THE ARRAY
[saurajeet.d@es1 ~]$ sudo mkfs.xfs /dev/md0 log stripe unit (524288 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB meta-data=/dev/md0 isize=256 agcount=16, agsize=8388480 blks = sectsz=512 attr=2 data = bsize=4096 blocks=134215680, imaxpct=25 = sunit=128 swidth=512 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=65536, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0You can calculate optimal options using a calculator here: http://uclibc.org/~aldot/mkfs_stride.html in case u r using ext based file systems
You can also use tune2fs to tune out the filesystem options past this point.
BUILDING RAID
[saurajeet.d@es1 ~]$ sudo cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 xvdj[5] xvdi[3] xvdh[2] xvdg[1] xvdf[0] 536864768 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_] [====>................] recovery = 22.1% (29781288/134216192) finish=125.9min speed=13822K/sec unused devices: <none>
YOU CAN CHECK THE HEALTH OF THE ARRAY
[saurajeet.d@es1 ~]$ sudo mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Mon Nov 5 10:19:55 2012 Raid Level : raid5 Array Size : 536864768 (511.99 GiB 549.75 GB) Used Dev Size : 134216192 (128.00 GiB 137.44 GB) Raid Devices : 5 Total Devices : 5 Persistence : Superblock is persistent
Update Time : Mon Nov 5 11:00:47 2012 State : clean, degraded, recovering Active Devices : 4 Working Devices : 5 Failed Devices : 0 Spare Devices : 1
Layout : left-symmetric Chunk Size : 512K
Rebuild Status : 30% complete
Name : 0 UUID : e47f9d29:cf9f0299:e2305a7c:0d03d014 Events : 9
Number Major Minor RaidDevice State 0 202 80 0 active sync /dev/sdf 1 202 96 1 active sync /dev/sdg 2 202 112 2 active sync /dev/sdh 3 202 128 3 active sync /dev/sdi 5 202 144 4 spare rebuilding /dev/sdj
NOW THAT THE RAID BUILDUP IS DONE. IF NOT WAIT TILL `cat /proc/mdstat` shows out 100% rebuilt
We have a potential problem. All storage devices are booted during the kernel loading process. Sometimes errors occur which involves the md0 to rename to md127 or some number. That is because Software RAID is a kernel module which needs takes time out of the running kernel and this missing timeline to do unexpected results. We solve this problem by putting mdadm.conf to the ramfs.ramfs is a vital peice of information for booting. Any Fuckup may render the system unbootable.
NOTICE MATCHING VMLINUZ AND INITRD VERSION NUMBERS
[saurajeet.d@es1 ~]$ cd /boot [saurajeet.d@es1 boot]$ ls config-3.2.21-1.32.6.amzn1.x86_64 grub initramfs-3.2.21-1.32.6.amzn1.x86_64.img System.map-3.2.21-1.32.6.amzn1.x86_64 vmlinuz-3.2.21-1.32.6.amzn1.x86_64 [saurajeet.d@es1 boot]$ lsinitrd initramfs-3.2.21-1.32.6.amzn1.x86_64 | grep mdadm.conf [saurajeet.d@es1 boot]NOTICE NO FILE CALLED MDADM.CONF found. WE NEED TO INCLUDE THIS FILE USING THE DRACUT FACILITY.
[root@es1 boot]# cp initramfs-3.2.21-1.32.6.amzn1.x86_64.img initramfs-3.2.21-1.32.6.amzn1.x86_64.img.1 [root@es1 boot]# dracut "initramfs-$(uname -r).img" $(uname -r) E: Will not override existing initramfs (/boot/initramfs-3.2.21-1.32.6.amzn1.x86_64.img) without --force [root@es1 boot]# dracut "initramfs-$(uname -r).img" $(uname -r) --force E: Will not override existing initramfs (/boot/initramfs-3.2.21-1.32.6.amzn1.x86_64.img) without --force [root@es1 boot]# ls -l initramfs-3.2.21-1.32.6.amzn1.x86_64.img -rw-r--r-- 1 root root 5955095 Jun 25 22:51 initramfs-3.2.21-1.32.6.amzn1.x86_64.img [root@es1 boot]# rm initramfs-3.2.21-1.32.6.amzn1.x86_64.img rm: remove regular file `initramfs-3.2.21-1.32.6.amzn1.x86_64.img'? y [root@es1 boot]# dracut "initramfs-$(uname -r).img" $(uname -r) [root@es1 boot]# ls -l total 18684 -rw-r--r-- 1 root root 60125 Jun 23 02:34 config-3.2.21-1.32.6.amzn1.x86_64 drwxr-xr-x 2 root root 4096 Sep 4 12:14 grub -rw-r--r-- 1 root root 8963954 Nov 5 14:35 initramfs-3.2.21-1.32.6.amzn1.x86_64.img -rw-r--r-- 1 root root 5955095 Nov 5 14:32 initramfs-3.2.21-1.32.6.amzn1.x86_64.img.1 -rw------- 1 root root 1402139 Jun 23 02:34 System.map-3.2.21-1.32.6.amzn1.x86_64 -rwxr-xr-x 1 root root 2739088 Jun 23 02:34 vmlinuz-3.2.21-1.32.6.amzn1.x86_64 [root@es1 boot]# sudo reboot^C [root@es1 boot]# lsinitrd initramfs-3.2.21-1.32.6.amzn1.x86_64.img | grep mdadm -rwxr-xr-x 1 root root 451344 Mar 16 2012 sbin/mdadm -rwxr-xr-x 1 root root 106 Jan 6 2012 sbin/mdadm_auto -rw-r--r-- 1 root root 106 Nov 5 10:23 etc/mdadm.confYou can try
draconf mdadmconf="{yes|no}"if the dracut execution still doesnt include mdadm.conf
REBOOT AND TEST
saurajeet@saurajeet-THX:~$ ssh 10.10.XX.XXXX Last login: Mon Nov 5 13:42:42 2012 from XX.XX.XX.XXX
__| __|_ ) _| ( / Amazon Linux AMI ___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2012.03-release-notes/ There are 30 security update(s) out of 167 total update(s) available Run "sudo yum update" to apply all updates. Amazon Linux version 2012.09 is available. [saurajeet.d@es1 ~]$ ls /dev/md md/ md0 [saurajeet.d@es1 ~]$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 xvdh[2] xvdj[5] xvdf[0] xvdi[3] xvdg[1] 536864768 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU] unused devices: <none> [saurajeet.d@es1 ~]$ sudo mdadm -D /dev/md0 /dev/md0: Version : 1.2 Creation Time : Mon Nov 5 10:19:55 2012 Raid Level : raid5 Array Size : 536864768 (511.99 GiB 549.75 GB) Used Dev Size : 134216192 (128.00 GiB 137.44 GB) Raid Devices : 5 Total Devices : 5 Persistence : Superblock is persistent
Update Time : Mon Nov 5 13:03:33 2012 State : clean Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0
Layout : left-symmetric Chunk Size : 512K
Name : 0 UUID : e47f9d29:cf9f0299:e2305a7c:0d03d014 Events : 26
Number Major Minor RaidDevice State 0 202 80 0 active sync /dev/sdf 1 202 96 1 active sync /dev/sdg 2 202 112 2 active sync /dev/sdh 3 202 128 3 active sync /dev/sdi 5 202 144 4 active sync /dev/sdj [saurajeet.d@es1 ~]$
TROUBLE SHOOT
If the RAID is not builtup during the bootup time. We may try mdadm -I which is script to automatically assemble raid from configuration files. If it fails for some reason lookup mdadm -As (Assemble and Scan), which does a manual scanning for drives to get a RAID timestamp and attempts rebuilding from there.COMPLETE THE SETUP WITH MOUNTING
[saurajeet.d@es1 ~]$ sudo cat /etc/fstab # LABEL=/ / ext4 defaults,noatime 1 1 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 /dev/sda3 none swap sw,comment=cloudconfig 0 0 /dev/md0 /data auto defaults,noatime 0 0
MOUNT
[saurajeet.d@es1 ~]$ sudo mkdir /data [saurajeet.d@es1 ~]$ sudo mount /dev/md0
[saurajeet.d@es1 ~]$ ls /data -l total 0 [saurajeet.d@es1 ~]$ mount /dev/xvda1 on / type ext4 (rw,noatime) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) /dev/md0 on /data type xfs (rw,noatime) [saurajeet.d@es1 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 8.0G 1.1G 6.9G 14% / tmpfs 3.7G 0 3.7G 0% /dev/shm /dev/md0 512G 33M 512G 1% /data [saurajeet.d@es1 ~]$
WHY 512G
Capacity of RAID = Smin * (n - 1) => 128G * (5 -1) => 128G * 4 => 512GFOLLOW UPS
http://www.tldp.org/HOWTO/Software-RAID-HOWTO-5.htmlhttp://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5
http://zackreed.me/articles/38-software-raid-5-in-debian-with-mdadm
https://bugzilla.redhat.com/show_bug.cgi?id=606481
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/sec-Verifying_the_Initial_RAM_Disk_Image.html
FUTURE TASKS
CANT INCUDE CODE FOR AMAZON CLOUD FORMATION. You KNOW WHY, Happy RAIDING