Recovering from a power dip in your DRBD setup

As sysadmins, we've all been here. Well, hopefully not. But, if you stay in a country with an unreliable power grid like I do, then this article may be of some use to you.

I have had the utmost pleasure of being in a situation where my Supermicro 36-bay servers have tripped due to power cuts and generators not being started on time and believe me - this is quite the challenge if you've ever had to get the LVM disks online again and the linstor service running.

First thing's first, when you SSH into your node, you'll probably see that the satellite is shown as offline when running

# linstor n l

You'll probably also see all the disks are missing when checking

# linstor v l

As well as

# lsblk -f

On the satellite node.

Oh, I forgot to mention that my setup runs on Proxmox. So, with all this in mind, I was unable to create new VMs or recover any VM disks from my NFS-mounted storage for backups.

My power failure was bad and I discovered that the LVM config was zapped from /dev/sda

After much cursing, swearing and doing much research, the following helped me recover the replication setup and get the disks back online:

Firstly, we need to copy the backup config from the first 2K of the disk in order to preserve the header info.

# dd if=/dev/sda of=/somewere/pve bs=1K count=2

I usually just stick the file into /root as I am the superuser on the system most of the time.

There should also be a backup file from the Proxmox/DRBD setup under

/etc/lvm/backup/pve

We then need to nuke the first section of the disk -

# dd if=/dev/zero bs=1k count=2 of=/dev/sda
# sync
# pvcreate -ff --uuid 06kvOm-xN2Y-iX1y-Tt69-LfFh-ltkG-TtRsgy --restorefile /etc/lvm/backup/pve /dev/sda
# vgcfgrestore --force drbdpool

You need to use the uuid from the /etc/lvm/backup/pve file, usually on line 25.

A little side/middle note, sometimes the kernel module drbdtransporttcp won't be loaded, just load it with

# modprobe drbd_transport_tcp

You will probably see some sort of LVM error when checking

# systemctl status linstor-satellite

We still need to run an fsck on the LVM disk(s)

# lvchange -an pve/data

# lvconvert --repair pve/data

# lvchange -ay pve/data

Then simply restart the satellite and controller services on both nodes:

# systemctl restart linstor-satellite && systemctl restart linstor-controller

You may have to recreate some resources on the controller node as you might see a Diskless or other red errors when running

# linstor r l 
or 
# linstor v l

# linstor r d blackhole vm-142-disk-1
# linstor r c blackhole vm-142-disk-1 --storage-pool drbdpool

r = resource, d = delete and c = create. My nodes are called blackhole and nebula

That should be it!

I sincerely hope that this article has helped you. And I hope that I haven't missed anything. Please feel free to let me know if I've missed anything or made an error anywhere. This is my very first technical article and I will be writing some more Linux topics fairly frequently for you all.

If you've enjoyed this and it has helped you, please consider buying me a coffee. I am very sleepy after getting through this recovery process :D

Comprar Alexis Panopoulos un café

More from Alexis Panopoulos