Wed, 11/22/2006 - 01:49 — Tuba
Today a server failed, and all data seemed lost.
I was sent to the data centre to investigate, and to salvage what I could. After a pleasant three hour drive I was ready to get cracking. The server in question was set up to run linux software RAID, with LVM2 on top of the array. Booting was done from a manually sync'ed ext2 partition on both of the disks in the server. Not the ideal way of doing things, but usually quite ok for low-end solutions. Normally, we'd resolve a primary disk failure by booting from the secondary disk. Only that failed in this particular case.
The first step was to secure a disk image before tampering with the disk in any way, after I had the image securely backed up (yay for dd and ssh) I did a quick analysis of the situation. The primary disk was beyond any hope of recovery, as it had a nasty clicking sound and all the tell-tale grinding noises of repeated head-crashes. Not good. The secondary disk was mechanically sound, but the boot partition had been smashed to bits on the superblock level - even the secondary superblock was all zeroes. Luckily, the RAID partition seemed okay.
To make sure I had a nice platform for data recovery I replaced the primary disk with a larger disk, and made a single-disk OS install. Then I started on the fun part:
mdadm --examine --scan /dev/sdb2This revealed a nice functional intact half of an array, which I proceeded to make active by redirecting the output of the above command into /etc/mdadm.conf and adding the missing pieces. This is what I wound up with:
DEVICE partitions
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=20c6826a:4e5edc36:d95c68f1:25d7cc76
devices=/dev/sdb2,missing
I was then able to restore md0 to life using the command "mdadm -A -s". Time for LVM2, which proved equally easy:
[root@host~]# pvscan
PV /dev/md0 VG RAID1 lvm2 [74.38 GB / 32.00 MB free]
Total: 1 [74.38 GB] / in use: 1 [74.38 GB] / in no VG: 0 [0 ]
The LVM2 partitions seemed intact, something proved by:
[root@host ~]# lvscan
ACTIVE '/dev/RAID1/Root' [70.38 GB] inherit
ACTIVE '/dev/RAID1/Swap' [3.97 GB] inherit
A simple "mount /dev/RAID1/Root /mnt" later I was restoring data and after two hours of work the server was temporarily back online on the spare disk. Not particularly fast, but all data was recovered.