OLVM Engine Disaster Recovery Procedures
Peter Goldthorp, Dito. February 2022
This document describes procedures to recover from an OLVM engine failure.
Actions
- Gather information and identify severity.
- Run health checks on VMs provisioned using OLVM. Identify which (if any) are affected by the outage. OLVM is primarily being used as a meta data repository provisioning mechanism for VMs and storage. Once provisioned the VMs should continue to operate independently.
- Identify the type of disaster. Examples could include an accidental deletion or shutdown of the VM where the OLVM Engine is installed, a database corruption of the OLVM engine or human error while configuring resources.
- Identify when the disaster occurred. It may have occurred some time ago. Check the log files in cloud storage to find the last valid backup.
- If the OLVM engine VM is running check the log files in the
/var/log/ovirt-engine
directory
- Notify affected individuals and take steps to prevent further damage. For example, if the issue was caused by a person or process make sure they have stopped.
- Take a snapshot of the OLVM engine VM’s boot disk for later analysis
- Develop and execute a recovery plan using the follow scenarios for guidance
Scenarios
Accidental shutdown or deletion of the OLVM VM
Symptoms: Calls to the OLVM Engine UI fail with an http 502 error originating from the load balancer associated with the instance group.
Diagnostics
- Log into the cloud console and navigate to the VM instances page. If the OLVM engine VM is in a stopped state, try starting it. Note it may take a few minutes for the load balancer to recognize the VM is back online.
- If the VM is running, check to make sure it is associated with an instance group.
- If the VM is missing it will need to be recreated
Recovery: Drop and recreate the OLVM Engine VM
-
Make a note of the OLVM Engine VM’s properties for future use. Example:
Property Value GCP Project my-gcp-project VM Name olvm-w2 Machine Type e2-standard-2 Zone us-west2-c Network gcp-shared-vpc-vpc Subnet gcp-shared-vpc-west2-subnet Internal IP Address olvm-w2-internal-ip External IP Address None Network Tags iap-forwarding
bms-websocket
devops-olvm-manager
devops-remote-access
olvm
use-gcp-default-gw-route
Unmanaged Instance Group Name olvm-unmanaged-ig - Open the VM image details screen and use the
Create Machine Image
button at the top of the page to create a copy of the corrupted VM. - Delete the corrupt VM. Note: if the delete option is not enabled you may need to edit the VM and turn off deletion protection
- Navigate to Compute Engine - Snapshots in the GCP console and locate the latest disk image or snapshot of the VMs boot disk
- Select the image and use
CREATE INSTANCE
to recreate the VM- Use the same VM name as the deleted instance - example
olvm-w2
and other values from the OLVM Engine VM properties table - hit
Create
- Use the same VM name as the deleted instance - example
- Navigate to Compute Engine - Instance groups in the GCP console Update the OLVM unmanaged instance group to associate the recreated VM with it
OLVM database corruption
Symptoms: The OLVM UI is available but database connections fail with a database related error. Example: server_error: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
Diagnostics
- Login to the VM and verify the database is running
systemctl --all |grep postgres postgresql.service not-found inactive dead postgresql.service rh-postgresql10-postgresql.service loaded active running PostgreSQL database server
- The OLVM database service is
rh-postgresql10-postgresql.service
Recovery
- If the database status not
active running
it can be started usingsystemctl start rh-postgresql10-postgresql.service
- If problem persists review the log files in the
/var/log/ovirt-engine
directory to find the date and time that the corruption occurred - Perform a point in time recovery (see below)
Point in time recovery
Symptoms: OLVM engine needs to be reset to an earlier configuration as a result of an operator error or database corruption.
Diagnostics
- Review the log files in cloud storage to identify the most recent error free backup
- If the backup was in the last 7 days use the VM disk snapshot to recover as described in the Accidental shutdown section.
- Use the recovery procedures below if a VM disk image is not available.
Recovery
- [optional] Perform a fresh OLVM install on a new VM but do not run engine-setup
- Use gsutil or the download_olvmbkps_fromgcs.sh script download the backup
- Use the
olvm_restore.sh
script to perform the restoration. It accepts 2 parameters:- Full path with file name for the backup file
- Full path with file name for the restoration log file
Example:
bash olvm_restore.sh /root/backup/local_olvm_bkp/ovirt-engine-backup-20220125-1645 \ /root/backup/local_olvm_bkp/ovirt-engine-backup-20220125-1645.log
- If installing in a new VM
- Run
/usr/share/ovirt-engine/setup/bin/ovirt-engine-rename
- Update the unmanaged instance group to point a the new server
- Run
Note: The olvm_restore.sh will fail with a postgres login error if engine-setup has been run prior to starting the restoration.
Account locked
OLVM: Login Fails With Error:
Unable To Log In Because The User Account Is Disabled Or Locked. Contact The System Administrator.
Login to the OLVM engine VM and run
sudo ovirt-aaa-jdbc-tool user unlock <user>
Lost or forgotten credentials
Recovery
SSH into the OLVM engine VM and run
ovirt-aaa-jdbc-tool user password-reset <user>
Password: <password>
Reenter password: <password>
updating user admin...
user updated successfully
Invalid SSL Certificate
Symptoms Login fails with the message “The provided authorization grant for the auth code has expired”
Diagnostics
- Download the certificate from
https:///ovirt-engine/services/pki-resource?resource=ca-certificate&format=X509-PEM-CA
- Examine using
openssl x509 -text -in
Example:openssl x509 -text -in "pki-resource(3)" Certificate: Data: Version: 3 (0x2) Serial Number: 4096 (0x1000) Signature Algorithm: sha256WithRSAEncryption Issuer: C=US, O=c.dito-oracle-migration-dev.internal, CN=olvm5.c.dito-oracle-migration-dev.internal.21128 Validity Not Before: Jan 19 15:52:40 2022 GMT Not After : Jan 18 15:52:40 2032 GMT ... file continues
- Verify the CN entry (olvm5.c.dito-oracle-migration-dev.internal in the example) matches the VMs FQDN
Recovery
1 Use the oVirt engine rename tool if the names do not match
Unregistered endpoint
Symptom Login fails with the message “The FQDN used to access the system is not a valid engine FQDN. You must access the system using the engine FQDN or one of the engine alternate FQDNs.”
Recovery
-
Create or edit the 99-custom-sso-setup.conf systemctl conf file and restart the engine:
vi /etc/ovirt-engine/engine.conf.d/99-custom-sso-setup.conf SSO_ALTERNATE_ENGINE_FQDNS="alias1.example.com alias2.example.com" systemctl restart ovirt-engine
VM Health Check
- Login to one of the DNS VMs
- Use the
check_vm_status.sh
script shown below and included in the OLVM backup/restore scripts to check the status of the VMs in /etc/hosts. Run the script a number of times using different port numbers (e.g. 22 ssh, 443 https, 1521 oracle, 3128 squid) to identify any VMs that are not responding on their expected ports.
#!/bin/bash
### https://gist.github.com/perfecto25/4581e99c95df80b12896113cc2b6d958
### This script reads in a file in /etc/hosts format <ip> <hostname>, then attempts to netcat to the host using provided port
### if no port is provided, it will attempt to connect via port 22
### if no file is provided, it will use /etc/hosts to read in IPs
### Usage: ./check_vm_status.sh <port> <file>
### Example: ./check_vm_status.sh <- this will try scanning /etc/hosts and connect to each IP via port 22
### Example: ./check_vm_status.sh 21500 /home/user/testfile
# nctest
port=${1:-22} # default 22
file=${2:-"/etc/hosts"} # default /etc/hosts
RED='\033[1;31m'
GREEN='\033[1;32m'
NC='\033[0m' # no color
## check if netcat is installed
if (type nc 2>&1 >/dev/null)
then
echo "netcat is installed, proceeding.."
else
echo -e "${RED}[ERROR]${NC} netcat is not installed on this host"
exit 1
fi
while read -r line
do
if [[ -n $line ]] && [[ "${line}" != \#* ]]
then
ip=$(echo $line | awk '{print $1}')
hostname=$(echo $line | awk '{print $2}')
## if ipv4
if [[ $ip =~ ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
echo "--------------------------------------"
## attempt netcat connection, timeout of 2
if (nc -z -w 2 $ip $port 2>&1 >/dev/null)
then
echo -e "${hostname} nc ${ip} ${port} ... ${GREEN}ok${NC}"
else
echo -e "${hostname} nc ${ip} ${port} ... ${RED}[FAIL]${NC}"
fi
fi
fi
done < $file
Copyright © Dito LLC, 2023