4 min to read
GitLab VM Disaster Recovery: Service Restoration with NBD Mount and Backup Restore
How to recover a failed GitLab VM using NBD, backup extraction, and troubleshooting Prometheus/NFS issues

Overview
A critical outage occurred when the GitLab VM suddenly entered rescue mode, causing a complete service interruption. GRUB recovery was impossible, but fortunately, key data was stored on NFS, preventing total data loss.
This post details the full process of mounting the damaged VM’s qcow2 disk using NBD (Network Block Device), extracting backup data, and restoring GitLab in a new environment. It also covers troubleshooting Prometheus permission issues and GitLab version downgrades encountered during recovery.
Incident Analysis
Key Symptoms
- GitLab VM booted into rescue mode
- GRUB recovery attempts failed
- System was unbootable
Positive Factors
- Core data was safely stored on NFS
- VM disk file (qcow2) was accessible
Recovery Process
1. Mounting the GitLab VM Disk with NBD
When direct access to the damaged VM was impossible, the qcow2 disk image was mounted using NBD.
# Load NBD module
modprobe nbd max_part=8
# Connect the GitLab VM qcow2 disk to an NBD device
qemu-nbd --connect=/dev/nbd0 /path/to/gitlab-vm.qcow2
2. Check Partitions and Mount
# Check partitions on the connected disk
fdisk -l /dev/nbd0
# Mount the GitLab root partition (usually /dev/nbd0p1)
mkdir -p /mnt/gitlab-root
mount /dev/nbd0p1 /mnt/gitlab-root
3. Extract Backup Data
Systematically extract important GitLab data from the mounted disk.
# Backup GitLab configuration (8.7M)
cp -r /mnt/gitlab-root/etc/gitlab /tmp/gitlab-config-backup
# Backup GitLab data (1.6G) - main database and repositories
cp -r /mnt/gitlab-root/var/opt/gitlab /tmp/gitlab-data-backup
# Backup home directory (88K)
cp -r /mnt/gitlab-root/home /tmp/home-backup
Example of extracted data structure:
gitlab-config-backup
(8.7M):/etc/gitlab/
- GitLab configuration filesgitlab-data-backup
(1.6G):/var/opt/gitlab/
- database, repositories, logs, etc.home-backup
(88K):/home/
- user home directories
4. Unmount and Clean Up
umount /mnt/gitlab-root
qemu-nbd --disconnect /dev/nbd0
5. Restore GitLab Backup
Restore GitLab in a new environment using the extracted backup.
sudo gitlab-backup restore BACKUP=1754284731_2025_08_04_18.1.1
6. Resolve Prometheus Permission Issues
During recovery, the Prometheus service failed with a permission error:
caller=query_logger.go:114 level=error component=activeQueryTracker
msg="Error opening query log file"
file=/mnt/nfs/gitlab/prometheus/data/queries.active
err="open /mnt/nfs/gitlab/prometheus/data/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
This was caused by NFS mount directory ownership. Fixed with:
sudo chown -R gitlab-prometheus:gitlab-prometheus /mnt/nfs/gitlab/prometheus/
7. GitLab Downgrade
The original version was 18.1.1, but the latest 18.2.1 was installed by mistake, causing backup restore to fail. Forced a downgrade and skipped auto-backup:
# Skip auto-backup
touch /etc/gitlab/skip-auto-backup
# Downgrade to a specific version
sudo apt-get install -y gitlab-ce=18.1.1-ce.0 --allow-downgrades
Lessons Learned and Improvements
- Enhanced Monitoring
- Closely monitor services like Prometheus
- Establish pre-checks for NFS permissions
- Backup Strategy Improvement
- Regular full system and config file backups
- Validate backup data with recovery tests
- Version Management Policy
- Test upgrades in staging before production
- Document rollback and downgrade procedures
- Infrastructure Architecture
- Consider high-availability to reduce single VM dependency
- Evaluate migration to container-based deployments
Conclusion
This GitLab incident demonstrated the value of NBD disk mounting for data recovery in extreme failure scenarios. Even when GRUB recovery was impossible, direct extraction from qcow2 disks is a major advantage of virtualization.
Thanks to NFS-based data distribution, total data loss was avoided, but the complexity of permission management was also highlighted. Issues like Prometheus permissions reinforce the need for detailed operational procedures.
Going forward, we plan to build a more robust backup and monitoring system and consider a more flexible, container-based architecture. While failures are inevitable, thorough preparation and systematic recovery can minimize downtime and ensure service continuity.
References
- How to properly set permissions for NFS folder? Permission denied on mounting end - Server Fault
- User permissions in NFS mounted directory - Unix & Linux Stack Exchange
- NFS clients getting ‘permission denied’, even when ownership and permissions are correct - SUSE Support
- Prometheus: err=”open /prometheus/queries.active: permission denied” - GitHub Issue
- Permission denied on /etc/prometheus/prometheus.yml; cannot deploy prom/prometheus container - Stack Overflow
- Prometheus pods not able to mount NFS volumes after upgrading the cluster - Red Hat Customer Portal
- GitLab Backup and Restore - Official Documentation
- GitLab Prometheus Monitoring - Official Documentation
- QEMU NBD Documentation
- Linux NBD - Network Block Device
Comments