Migrating data from Oracle to BigQuery

Peter Goldthorp, Dito April 2022

If all you have is a hammer, everything looks like a nail. The same is often true in software development. Organizations adapt their infrastructure to meet new requirements. A company running Oracle for transaction processing chooses Oracle to support their data warehouse. The warehouse evolves over a number of years into a data lake still running in an Oracle relational database. The result is an expensive environment that is not optimized for the task.

Selecting a database to meet your application requirements

Oracle vs BigQuery

Oracle is a general purpose relational database. The same software is used to support OLTP and OLAP applications. It does both reasonably well but any as with general purpose tool compromises have to be made. Oracle works best with structured data conforming to a relational schema. It runs on dedicated hardware that needs to be sized for peak load. Scaling up is expensive. Oracle license fees are calculated based on the CPU capacity of a customer’s computers with the software installed. Oracle runs best on enterprise grade hardware with access to NAS or SAN storage. This equipment isn’t cheap.

BigQuery is Google’s data warehouse offering. BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform as a Service (PaaS) that supports querying using ANSI SQL. It also has built-in machine learning capabilities. BigQuery was designed as a data warehouse. No compromises were made to meet OLTP requirements or other needs. It handles structured and unstructured data equally well.

BigQuery’s architecture is massively scalable. For example, it can allocate 300 disks, 3000 CPU cores, and 300 Gigabits of switching capacity on demand for the duration of a query. It does this by separating storage from compute capacity. Data is stored on a low cost, highly secure, distributed file system. Compute resources are added on-demand as needed and returned to a pool when no longer required. Queries that might take hours to complete in an Oracle RAC database can be completed in minutes using Google BigQuery.

In addition to this, customers save on the up-front license fees and administrative overhead associated with dedicated hardware.

BMS to GCP

Migrating to Google BMS allows you to reduce your dependency on Oracle by leveraging Google ecosystem services like BigQuery and CloudSQL for database operations. Each BMS environment has a low latency high speed connection to a GCP data center. This provides easy access to GCP data services like BigQuery.

Dito can help integrate Oracle databases running on BMS with our BigQuery inaBOX platform and open source offerings from Visulate. BigQuery inaBOX is a managed service offering from Dito which provides a security hardened BigQuery implementation. Visulate provides tools to catalog and query Oracle databases.

Migrating an Oracle Data Warehouse to BigQuery

Migration to BigQuery is a multi-stage project. A typical project might include the following.

Assess

  1. Review the Oracle data warehouse - Use Visulate for Oracle to review the structure and content of the current data warehouse.
    • Examine the database structure.
    • Identify the fact and dimension tables
    • Review cardinality and data volumes
    • Examine the logic and SQL in any stored procedures
    • Look for database links and materialized views that connect to other databases
  2. Review existing ETL - Data warehouses are typically assembled using data from other systems. Data is extracted from them, transformed in structure and then loaded into the warehouse. The process for doing this is referred to as ETL (extract, transform, load). Examine the current ELT procedures to:
    • Document existing data sources
    • Identify extraction method (full vs incremental) and transformation logic
    • Estimate data volume and update frequency
    • Estimate the effort to convert to BigQuery

Plan

  1. Identify requirements - Talk to the business users. Find out how they are using the system and identify existing limitations. Pay particular attention to capacity related limitations. BigQuery’s serverless architecture is massively scalable. This scalability allows it to address problems that couldn’t be attempted with legacy systems. Develop a”MoSCoW” list for the migrated warehouse:
    • Must have - absolute non-negotiable requirements
    • Should have - highly desirable features
    • Could have - optional features and design compromises
    • Won’t have - absolute non-negotiable items to avoid
  2. Identify machine learning opportunities - Data warehouses are often used to support decision making. BigQuery’s machine learning capabilities can be leveraged to provide recommendations given the correct training data. These are:
    • The inputs that a decision maker uses
    • Metrics used to measure success

Deploy

  1. Design new data model - Review and apply Google’s guidelines for BigQuery schema design
    • Start with the existing model identified in step 1.
    • Denormalize dimension tables larger than 10GB using nested and repeated fields
    • Keep dimension tables smaller than 10GB normalized, unless the table rarely undergoes UPDATE and DELETE operations.
    • Add structures to support new requirements
  2. Develop ETL/ELT procedures - Review the current ETL procedures and identify any that can be adapted to populate BigQuery instead of Oracle. Develop new procedures for the remainder. Consider:
    • Full vs Incremental - Some data is ingested in its entirety with each data load. For example, crime statistics and census data updates are typically self contained within a time period. Other sources require incremental updates. The system must track changes over time and load only the delta. This process is referred to as change data capture.
    • Change Data Capture (CDC) can be application or infrastructure driven. An application based solution might run SQL queries against the source data to identify changes since the last synchronization. An infrastructure based solution replicates entire tables from the source and then monitors the archive log files for changes. Visulate CSV file generation can be used to implement an application based solution. Fivetran can be used to perform infrastructure CDC. Both approaches are compatible with Dito BigQuery inaBOX. Fivetran can be configured to write directly to BigQuery or to Cloud Storage if data transformations are required
    • ETL vs ELT - The data pipelines that populate data warehouses usually follow an ETL pattern. Data is transformed before loading. Pipelines for data lakes often forgo this transformation. Data is replicated in its raw state and transformed as needed.

Refine

  1. Migrate and Test - Migrate reports and develop new functionality based on the business requirements
    • Review the query costs
    • Limit the number of query columns to limit costs
    • Refine the data model using partitions and clustering to optimize expensive queries
    • Run the old and new systems in parallel and compare results. Investigate any discrepancies.
  2. Decommission the Oracle data warehouse
    • Export
    • Shutdown
    • Delete

Copyright © Dito LLC, 2023