OLVM Engine Disaster Recovery Procedures

Peter Goldthorp, Dito. February 2022

This document describes procedures to recover from an OLVM engine failure.

Actions

  1. Gather information and identify severity.
    • Run health checks on VMs provisioned using OLVM. Identify which (if any) are affected by the outage. OLVM is primarily being used as a meta data repository provisioning mechanism for VMs and storage. Once provisioned the VMs should continue to operate independently.
    • Identify the type of disaster. Examples could include an accidental deletion or shutdown of the VM where the OLVM Engine is installed, a database corruption of the OLVM engine or human error while configuring resources.
    • Identify when the disaster occurred. It may have occurred some time ago. Check the log files in cloud storage to find the last valid backup.
    • If the OLVM engine VM is running check the log files in the /var/log/ovirt-engine directory
  2. Notify affected individuals and take steps to prevent further damage. For example, if the issue was caused by a person or process make sure they have stopped.
  3. Take a snapshot of the OLVM engine VM’s boot disk for later analysis
  4. Develop and execute a recovery plan using the follow scenarios for guidance

Scenarios

Accidental shutdown or deletion of the OLVM VM

Symptoms: Calls to the OLVM Engine UI fail with an http 502 error originating from the load balancer associated with the instance group.

Diagnostics

  • Log into the cloud console and navigate to the VM instances page. If the OLVM engine VM is in a stopped state, try starting it. Note it may take a few minutes for the load balancer to recognize the VM is back online.
  • If the VM is running, check to make sure it is associated with an instance group.
  • If the VM is missing it will need to be recreated

Recovery: Drop and recreate the OLVM Engine VM

  1. Make a note of the OLVM Engine VM’s properties for future use. Example:

    Property Value
    GCP Project my-gcp-project
    VM Name olvm-w2
    Machine Type e2-standard-2
    Zone us-west2-c
    Network gcp-shared-vpc-vpc
    Subnet gcp-shared-vpc-west2-subnet
    Internal IP Address olvm-w2-internal-ip
    External IP Address None
    Network Tags iap-forwarding bms-websocket devops-olvm-manager devops-remote-access olvm use-gcp-default-gw-route
    Unmanaged Instance Group Name olvm-unmanaged-ig
  2. Open the VM image details screen and use the Create Machine Image button at the top of the page to create a copy of the corrupted VM.
  3. Delete the corrupt VM. Note: if the delete option is not enabled you may need to edit the VM and turn off deletion protection
  4. Navigate to Compute Engine - Snapshots in the GCP console and locate the latest disk image or snapshot of the VMs boot disk
  5. Select the image and use CREATE INSTANCE to recreate the VM
    • Use the same VM name as the deleted instance - example olvm-w2 and other values from the OLVM Engine VM properties table
    • hit Create
  6. Navigate to Compute Engine - Instance groups in the GCP console Update the OLVM unmanaged instance group to associate the recreated VM with it

OLVM database corruption

Symptoms: The OLVM UI is available but database connections fail with a database related error. Example: server_error: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

Diagnostics

  • Login to the VM and verify the database is running
      systemctl --all |grep postgres
    
      postgresql.service                 not-found inactive dead      postgresql.service
      rh-postgresql10-postgresql.service loaded    active   running   PostgreSQL database server
    
  • The OLVM database service is rh-postgresql10-postgresql.service

Recovery

  1. If the database status not active running it can be started using
     systemctl start rh-postgresql10-postgresql.service
    
  2. If problem persists review the log files in the /var/log/ovirt-engine directory to find the date and time that the corruption occurred
  3. Perform a point in time recovery (see below)

Point in time recovery

Symptoms: OLVM engine needs to be reset to an earlier configuration as a result of an operator error or database corruption.

Diagnostics

  • Review the log files in cloud storage to identify the most recent error free backup
  • If the backup was in the last 7 days use the VM disk snapshot to recover as described in the Accidental shutdown section.
  • Use the recovery procedures below if a VM disk image is not available.

Recovery

  1. [optional] Perform a fresh OLVM install on a new VM but do not run engine-setup
  2. Use gsutil or the download_olvmbkps_fromgcs.sh script download the backup
  3. Use the olvm_restore.sh script to perform the restoration. It accepts 2 parameters:
    1. Full path with file name for the backup file
    2. Full path with file name for the restoration log file

    Example:

     bash olvm_restore.sh /root/backup/local_olvm_bkp/ovirt-engine-backup-20220125-1645 \
     /root/backup/local_olvm_bkp/ovirt-engine-backup-20220125-1645.log
    
  4. If installing in a new VM
    1. Run /usr/share/ovirt-engine/setup/bin/ovirt-engine-rename
    2. Update the unmanaged instance group to point a the new server

Note: The olvm_restore.sh will fail with a postgres login error if engine-setup has been run prior to starting the restoration.

Account locked

OLVM: Login Fails With Error:

Unable To Log In Because The User Account Is Disabled Or Locked. Contact The System Administrator.

Login to the OLVM engine VM and run

sudo ovirt-aaa-jdbc-tool user unlock <user>

Lost or forgotten credentials

Recovery

SSH into the OLVM engine VM and run

ovirt-aaa-jdbc-tool user password-reset <user>
Password: <password>
Reenter password: <password>
updating user admin...
user updated successfully

Invalid SSL Certificate

Symptoms Login fails with the message “The provided authorization grant for the auth code has expired”

Diagnostics

  • Download the certificate from https:///ovirt-engine/services/pki-resource?resource=ca-certificate&format=X509-PEM-CA
  • Examine using openssl x509 -text -in Example:
      openssl x509 -text -in "pki-resource(3)"
      Certificate:
          Data:
              Version: 3 (0x2)
              Serial Number: 4096 (0x1000)
          Signature Algorithm: sha256WithRSAEncryption
              Issuer: C=US, O=c.dito-oracle-migration-dev.internal, CN=olvm5.c.dito-oracle-migration-dev.internal.21128
              Validity
                  Not Before: Jan 19 15:52:40 2022 GMT
                  Not After : Jan 18 15:52:40 2032 GMT
    
                  ... file continues
    
  • Verify the CN entry (olvm5.c.dito-oracle-migration-dev.internal in the example) matches the VMs FQDN

Recovery

1 Use the oVirt engine rename tool if the names do not match

Unregistered endpoint

Symptom Login fails with the message “The FQDN used to access the system is not a valid engine FQDN. You must access the system using the engine FQDN or one of the engine alternate FQDNs.”

Recovery

  1. Create or edit the 99-custom-sso-setup.conf systemctl conf file and restart the engine:

     vi /etc/ovirt-engine/engine.conf.d/99-custom-sso-setup.conf
    
     SSO_ALTERNATE_ENGINE_FQDNS="alias1.example.com alias2.example.com"
    
     systemctl restart ovirt-engine
    

VM Health Check

  1. Login to one of the DNS VMs
  2. Use the check_vm_status.sh script shown below and included in the OLVM backup/restore scripts to check the status of the VMs in /etc/hosts. Run the script a number of times using different port numbers (e.g. 22 ssh, 443 https, 1521 oracle, 3128 squid) to identify any VMs that are not responding on their expected ports.
#!/bin/bash
### https://gist.github.com/perfecto25/4581e99c95df80b12896113cc2b6d958

### This script reads in a file in /etc/hosts format <ip> <hostname>, then attempts to netcat to the host using provided port
### if no port is provided, it will attempt to connect via port 22
### if no file is provided, it will use /etc/hosts to read in IPs
### Usage: ./check_vm_status.sh <port> <file>
### Example: ./check_vm_status.sh  <- this will try scanning /etc/hosts and connect to each IP via port 22
### Example: ./check_vm_status.sh 21500 /home/user/testfile


# nctest
port=${1:-22}  # default 22
file=${2:-"/etc/hosts"}  # default /etc/hosts
RED='\033[1;31m'
GREEN='\033[1;32m'
NC='\033[0m'  # no color

## check if netcat is installed
if (type nc 2>&1 >/dev/null)
then
    echo "netcat is installed, proceeding.."
else
    echo -e "${RED}[ERROR]${NC} netcat is not installed on this host"
    exit 1
fi


while read -r line
do
    if [[ -n $line ]] && [[ "${line}" != \#* ]]
    then
        ip=$(echo $line | awk '{print $1}')
        hostname=$(echo $line | awk '{print $2}')

        ## if ipv4
        if [[ $ip =~ ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
            echo "--------------------------------------"

            ## attempt netcat connection, timeout of 2
            if (nc -z -w 2 $ip $port 2>&1 >/dev/null)
            then
                echo -e "${hostname}   nc ${ip} ${port} ... ${GREEN}ok${NC}"
            else
                echo -e "${hostname}   nc ${ip} ${port} ... ${RED}[FAIL]${NC}"
            fi
        fi
    fi

done < $file

Copyright © Dito LLC, 2023