OLVM Engine Disaster Recovery Procedures
- Actions
Scenarios
VM Health Check

OLVM Engine Disaster Recovery Procedures

Peter Goldthorp, Dito. February 2022

This document describes procedures to recover from an OLVM engine failure.

Actions

Gather information and identify severity.
- Run health checks on VMs provisioned using OLVM. Identify which (if any) are affected by the outage. OLVM is primarily being used as a meta data repository provisioning mechanism for VMs and storage. Once provisioned the VMs should continue to operate independently.
- Identify the type of disaster. Examples could include an accidental deletion or shutdown of the VM where the OLVM Engine is installed, a database corruption of the OLVM engine or human error while configuring resources.
- Identify when the disaster occurred. It may have occurred some time ago. Check the log files in cloud storage to find the last valid backup.
- If the OLVM engine VM is running check the log files in the /var/log/ovirt-engine directory
Notify affected individuals and take steps to prevent further damage. For example, if the issue was caused by a person or process make sure they have stopped.
Take a snapshot of the OLVM engine VM’s boot disk for later analysis
Develop and execute a recovery plan using the follow scenarios for guidance

Scenarios

Accidental shutdown or deletion of the OLVM VM

Symptoms: Calls to the OLVM Engine UI fail with an http 502 error originating from the load balancer associated with the instance group.

Diagnostics

Log into the cloud console and navigate to the VM instances page. If the OLVM engine VM is in a stopped state, try starting it. Note it may take a few minutes for the load balancer to recognize the VM is back online.
If the VM is running, check to make sure it is associated with an instance group.
If the VM is missing it will need to be recreated

Recovery: Drop and recreate the OLVM Engine VM

Make a note of the OLVM Engine VM’s properties for future use. Example:

Property	Value
GCP Project	my-gcp-project
VM Name	olvm-w2
Machine Type	e2-standard-2
Zone	us-west2-c
Network	gcp-shared-vpc-vpc
Subnet	gcp-shared-vpc-west2-subnet
Internal IP Address	olvm-w2-internal-ip
External IP Address	None
Network Tags	`iap-forwarding` `bms-websocket` `devops-olvm-manager` `devops-remote-access` `olvm` `use-gcp-default-gw-route`
Unmanaged Instance Group Name	olvm-unmanaged-ig

Open the VM image details screen and use the Create Machine Image button at the top of the page to create a copy of the corrupted VM.
Delete the corrupt VM. Note: if the delete option is not enabled you may need to edit the VM and turn off deletion protection
Navigate to Compute Engine - Snapshots in the GCP console and locate the latest disk image or snapshot of the VMs boot disk
Select the image and use CREATE INSTANCE to recreate the VM
- Use the same VM name as the deleted instance - example olvm-w2 and other values from the OLVM Engine VM properties table
- hit Create
Navigate to Compute Engine - Instance groups in the GCP console Update the OLVM unmanaged instance group to associate the recreated VM with it

OLVM database corruption

Symptoms: The OLVM UI is available but database connections fail with a database related error. Example: server_error: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

Diagnostics

  systemctl --all |grep postgres

  postgresql.service                 not-found inactive dead      postgresql.service
  rh-postgresql10-postgresql.service loaded    active   running   PostgreSQL database server

The OLVM database service is rh-postgresql10-postgresql.service

Recovery

If the database status not active running it can be started using
```
 systemctl start rh-postgresql10-postgresql.service
```
If problem persists review the log files in the /var/log/ovirt-engine directory to find the date and time that the corruption occurred
Perform a point in time recovery (see below)

Point in time recovery

Symptoms: OLVM engine needs to be reset to an earlier configuration as a result of an operator error or database corruption.

Diagnostics

Review the log files in cloud storage to identify the most recent error free backup
If the backup was in the last 7 days use the VM disk snapshot to recover as described in the Accidental shutdown section.
Use the recovery procedures below if a VM disk image is not available.

Recovery

[optional] Perform a fresh OLVM install on a new VM but do not run engine-setup
Use gsutil or the download_olvmbkps_fromgcs.sh script download the backup
Use the olvm_restore.sh script to perform the restoration. It accepts 2 parameters:
1. Full path with file name for the backup file
2. Full path with file name for the restoration log file
Example:
```
 bash olvm_restore.sh /root/backup/local_olvm_bkp/ovirt-engine-backup-20220125-1645 \
 /root/backup/local_olvm_bkp/ovirt-engine-backup-20220125-1645.log
```
If installing in a new VM
1. Run /usr/share/ovirt-engine/setup/bin/ovirt-engine-rename
2. Update the unmanaged instance group to point a the new server

Note: The olvm_restore.sh will fail with a postgres login error if engine-setup has been run prior to starting the restoration.

Account locked

OLVM: Login Fails With Error:

Unable To Log In Because The User Account Is Disabled Or Locked. Contact The System Administrator.

sudo ovirt-aaa-jdbc-tool user unlock <user>

Lost or forgotten credentials

Recovery

SSH into the OLVM engine VM and run

ovirt-aaa-jdbc-tool user password-reset <user>
Password: <password>
Reenter password: <password>
updating user admin...
user updated successfully

Invalid SSL Certificate

Symptoms Login fails with the message “The provided authorization grant for the auth code has expired”

Diagnostics

Download the certificate from https:///ovirt-engine/services/pki-resource?resource=ca-certificate&format=X509-PEM-CA

Examine using openssl x509 -text -in Example:

  openssl x509 -text -in "pki-resource(3)"
  Certificate:
      Data:
          Version: 3 (0x2)
          Serial Number: 4096 (0x1000)
      Signature Algorithm: sha256WithRSAEncryption
          Issuer: C=US, O=c.dito-oracle-migration-dev.internal, CN=olvm5.c.dito-oracle-migration-dev.internal.21128
          Validity
              Not Before: Jan 19 15:52:40 2022 GMT
              Not After : Jan 18 15:52:40 2032 GMT

              ... file continues

Verify the CN entry (olvm5.c.dito-oracle-migration-dev.internal in the example) matches the VMs FQDN

Recovery

1 Use the oVirt engine rename tool if the names do not match

Unregistered endpoint

Symptom Login fails with the message “The FQDN used to access the system is not a valid engine FQDN. You must access the system using the engine FQDN or one of the engine alternate FQDNs.”

Recovery

Create or edit the 99-custom-sso-setup.conf systemctl conf file and restart the engine:

 vi /etc/ovirt-engine/engine.conf.d/99-custom-sso-setup.conf

 SSO_ALTERNATE_ENGINE_FQDNS="alias1.example.com alias2.example.com"

 systemctl restart ovirt-engine

VM Health Check

Login to one of the DNS VMs
Use the check_vm_status.sh script shown below and included in the OLVM backup/restore scripts to check the status of the VMs in /etc/hosts. Run the script a number of times using different port numbers (e.g. 22 ssh, 443 https, 1521 oracle, 3128 squid) to identify any VMs that are not responding on their expected ports.

#!/bin/bash
### https://gist.github.com/perfecto25/4581e99c95df80b12896113cc2b6d958

### This script reads in a file in /etc/hosts format <ip> <hostname>, then attempts to netcat to the host using provided port
### if no port is provided, it will attempt to connect via port 22
### if no file is provided, it will use /etc/hosts to read in IPs
### Usage: ./check_vm_status.sh <port> <file>
### Example: ./check_vm_status.sh  <- this will try scanning /etc/hosts and connect to each IP via port 22
### Example: ./check_vm_status.sh 21500 /home/user/testfile


# nctest
port=${1:-22}  # default 22
file=${2:-"/etc/hosts"}  # default /etc/hosts
RED='\033[1;31m'
GREEN='\033[1;32m'
NC='\033[0m'  # no color

## check if netcat is installed
if (type nc 2>&1 >/dev/null)
then
    echo "netcat is installed, proceeding.."
else
    echo -e "${RED}[ERROR]${NC} netcat is not installed on this host"
    exit 1
fi


while read -r line
do
    if [[ -n $line ]] && [[ "${line}" != \#* ]]
    then
        ip=$(echo $line | awk '{print $1}')
        hostname=$(echo $line | awk '{print $2}')

        ## if ipv4
        if [[ $ip =~ ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
            echo "--------------------------------------"

            ## attempt netcat connection, timeout of 2
            if (nc -z -w 2 $ip $port 2>&1 >/dev/null)
            then
                echo -e "${hostname}   nc ${ip} ${port} ... ${GREEN}ok${NC}"
            else
                echo -e "${hostname}   nc ${ip} ${port} ... ${RED}[FAIL]${NC}"
            fi
        fi
    fi

done < $file