Efficiently Update Data from OLTP to OLAP

OLTP to OLAP: Acronym Time!

Online Transaction Processing, or OLTP, serves data to clients for line-of-business applications, enabling core business functionality. Online Analytical Processing, or OLAP, consists of analyzing OLTP data with the intent of providing reporting and business intelligence to management. One of our most important tasks is providing highly up-to-date data from the OLTP system into the OLAP system. Fresh data in the OLAP system allows business management to react quickly, whereas stale data in the OLAP system prevents quick, accurate decision making.

copying data from oltp to olap

      Copy all the things from OLTP to OLAP!

The last thing you want to do to a large, busy, OLTP system is to start running reporting against it. Even though this might provide business management with the freshest data possible, it will have the unfortunate side-effect of slowing down customer interaction. That might even have a negative effect on core business functionality, such as sales via a website, etc.

Get to the point!

This post presents a method that allows you to efficiently copy rows from OLTP tables into OLAP tables without needing to compare rows to see what changes have been made.

The code below creates a set of test tables; dbo.OLTP which will be the source data, and dbo.OLAP which we’ll efficiently copy fresh data into.

IF OBJECT_ID(N'dbo.OLTP', N'U') IS NOT NULL
DROP TABLE dbo.OLTP;

IF OBJECT_ID(N'dbo.OLAP', N'U') IS NOT NULL
DROP TABLE dbo.OLAP;

IF OBJECT_ID(N'dbo.OLAP_Updates', N'U') IS NOT NULL
DROP TABLE dbo.OLAP_Updates;
GO

CREATE TABLE dbo.OLTP
(
    Source_id int NOT NULL
        PRIMARY KEY
        CLUSTERED
    , oltp_data char(100) NOT NULL
    , rv rowversion NOT NULL
) ON [PRIMARY]
WITH (DATA_COMPRESSION = PAGE);

CREATE UNIQUE NONCLUSTERED INDEX OLTP_rv
ON dbo.OLTP(rv)
WITH (
    DATA_COMPRESSION = PAGE
    , FILLFACTOR = 100
    );

Note, the dbo.OLTP table has a column named rv, that is of type rowversion1. This column is solely used to track changes made to rows, and allows us to efficiently seek to new rows via the unique, non-clustered, index named OLTP_rv.

This code inserts 50,000,000 rows into the OLTP table:

INSERT INTO dbo.OLTP WITH (TABLOCKX) (Source_id, oltp_data) 
SELECT TOP(50000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    , oltp_data = REPLICATE('A', 100)
FROM sys.syscolumns sc
    CROSS JOIN sys.syscolumns sc2
    CROSS JOIN sys.syscolumns sc3;

Here, I create a couple of tables to use for analytics, or OLAP:

CREATE TABLE dbo.OLAP
(
    Source_id int NOT NULL
        PRIMARY KEY
        CLUSTERED
    , oltp_data char(100) NOT NULL
) ON [PRIMARY]
WITH (DATA_COMPRESSION = PAGE);

CREATE TABLE dbo.OLAP_Updates
(
    SourceSchema sysname NOT NULL
    , SourceTableName sysname NOT NULL
    , DestSchema sysname NOT NULL
    , DestTableName sysname NOT NULL
    , max_rv_captured binary(8) NOT NULL
    , CONSTRAINT OLA_Updates_pk
        PRIMARY KEY 
        CLUSTERED
        (SourceSchema, SourceTableName)
) ON [PRIMARY]
WITH (DATA_COMPRESSION = PAGE);

The 2nd table, dbo.OLAP_Updates, supports multiple source and destination tables, one per row, and is used to store the maximum row version we’ve migrated from the source OLTP table into the destination OLAP table.

These statements show the state of the OLTP and OLAP tables:

SELECT [OLTP] = OLTP.oltp_data 
    , [Rows] = COUNT(1) 
FROM dbo.OLTP
GROUP BY OLTP.oltp_data

SELECT [OLAP] = OLAP.oltp_data 
    , [Rows] = COUNT(1)
FROM dbo.OLAP
GROUP BY OLAP.oltp_data;

Initially, the output looks like this, since we haven’t copied any rows into the OLAP table:

╔════════════════╦══════════╗
║   OLTP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 50000000 ║
╚════════════════╩══════════╝

╔════════════════╦══════════╗
║   OLAP         ║   Rows   ║
╠════════════════╬══════════╣
║                ║          ║
╚════════════════╩══════════╝

The Fun Bit!

This code migrates new rows, as well as modified rows, in an efficient manner, using the MIN_ACTIVE_ROWVERSION()2 function:

SET NOCOUNT ON;
DECLARE @MaxRvCaptured binary(8);
    
IF OBJECT_ID(N'tempdb..#cci', N'U') IS NOT NULL
DROP TABLE #cci;
IF OBJECT_ID(N'tempdb..#rows', N'U') IS NOT NULL
DROP TABLE #rows;
IF OBJECT_ID(N'tempdb..#olap_updates', N'U') IS NOT NULL
DROP TABLE #olap_updates;

-- #olap_udpates is used to facilitate an efficient UPSERT operation 
CREATE TABLE #olap_updates
(
    indicator char(1) NOT NULL
);
-- #rows stores primary keys for rows that need to be copied into the OLAP table
CREATE TABLE #rows
(
    Source_id int NOT NULL
        PRIMARY KEY
        CLUSTERED
);
-- #cci is a dummy table used to allow BATCH mode for certain operations
CREATE TABLE #cci
(
    i int
    , INDEX cci_cx CLUSTERED COLUMNSTORE 
);

BEGIN TRANSACTION;
BEGIN TRY
    --get the last used rowversion from the previous time this code ran
    SET @MaxRvCaptured = (
        SELECT o.max_rv_captured 
        FROM dbo.OLAP_Updates o 
        WHERE o.SourceSchema = N'dbo' 
            AND o.SourceTableName = N'OLTP' 
            AND o.DestSchema = N'dbo' 
            AND o.DestTableName = N'OLAP'
            );

    --initialize @MaxRvCaptured if this is the first time this code has ran.
    IF @MaxRvCaptured IS NULL
    BEGIN
        SET @MaxRvCaptured = CONVERT(binary(8), 0);
    END

    --capture the primary keys for modified rows from the source table into a #temp table
    INSERT INTO #rows WITH (TABLOCKX) (Source_id)
    SELECT OLTP.Source_id
    FROM dbo.OLTP WITH (INDEX (OLTP_rv))
        LEFT JOIN #cci ON (1 = 0) --this allows batch mode on row-store tables in 2014+
    WHERE OLTP.rv >= @MaxRvCaptured;

    --delete rows from the destination table that are going to be updated
    DELETE FROM dbo.OLAP
    FROM dbo.OLAP
        INNER JOIN #rows r ON OLAP.Source_id = r.Source_id
        LEFT JOIN #cci ON (1 = 0);

    --insert new and modified rows from the source into the destination
    INSERT INTO dbo.OLAP (Source_id, oltp_data)
    SELECT OLTP.Source_id, OLTP.oltp_data
    FROM dbo.OLTP
        INNER JOIN #rows r ON OLTP.Source_id = r.Source_id
        LEFT JOIN #cci ON (1 = 0);

    --capture the MIN_ACTIVE_ROWVERSION() 
    UPDATE dbo.OLAP_Updates
    SET max_rv_captured = MIN_ACTIVE_ROWVERSION()
    OUTPUT 'Y' INTO #olap_updates (indicator)
    WHERE SourceSchema = N'dbo' 
            AND SourceTableName = N'OLTP' 
            AND DestSchema = N'dbo' 
            AND DestTableName = N'OLAP';

    --if OLAP_Updates doesn't contain a row for the set of tables we're dealing with, insert one.
    INSERT INTO dbo.OLAP_Updates (SourceSchema, SourceTableName, DestSchema, DestTableName, max_rv_captured)
    SELECT *
    FROM (VALUES (N'dbo', N'OLTP', N'dbo', N'OLAP', MIN_ACTIVE_ROWVERSION())) 
        v(SourceSchema, SourceTableName, DestSchema, DestTableName, max_rv_captured)
    WHERE NOT EXISTS (SELECT 1 FROM #olap_updates);

    COMMIT TRANSACTION;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0
    BEGIN
        ROLLBACK TRANSACTION
    END
    PRINT ERROR_MESSAGE()
END CATCH

The MIN_ACTIVE_ROWVERSION() function captures the minimum rowversion value currently active in the database. Capturing that value ensures we’ll never miss rows that have been updated in the OLTP data.

After the first run of the migrate code, we see the OLAP table contains all the rows from the OLTP table; the initial run will take a fair amount of resources if your source table has lots of pre-existing data:

╔════════════════╦══════════╗
║   OLTP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 50000000 ║
╚════════════════╩══════════╝


╔════════════════╦══════════╗
║   OLAP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 50000000 ║
╚════════════════╩══════════╝

The OLTP table implements a rowversion column that is automatically updated whenever a row is updated or inserted. The rowversion number is unique at the database level, and increments monotonically for all transactions that take place within the context of that database. The dbo.OLTP_Updates table is used to store the minimum row version available inside the transaction used to copy data from the OLTP table into the OLAP table. Each time this code runs it captures incremental changes. This is far more efficient than comparing all the rows in both tables using a hashing function since this method doesn’t require reading any data other than the source data that is either new, or has changed.

Note the code above uses the temporary clustered columnstore index, #cci, to prompt the query processor to use batch mode for certain operations; this is only available in SQL Server 2014+.

Results Time!

In the code below, we’re updating a portion of the 50 million rows, simulating the OLTP system making changes to data as a result of client interaction. 1 in every 1000 rows are updated to B..., 1 in every 400 rows are updated to ‘C…’, then 1 in every 201 rows are updated to D.... The results shown reflect running the rowversion-based migration code after each update.

UPDATE dbo.OLTP
SET oltp_data = REPLICATE('B', 100)
WHERE Source_id % 1000 = 0;
╔════════════════╦══════════╗
║   OLTP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49950000 ║
║ BBBBBBBBBBB... ║ 50000    ║
╚════════════════╩══════════╝


╔════════════════╦══════════╗
║   OLAP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49950000 ║
║ BBBBBBBBBBB... ║ 50000    ║
╚════════════════╩══════════╝
UPDATE dbo.OLTP
SET oltp_data = REPLICATE('C', 100)
WHERE Source_id % 400 = 0;
╔════════════════╦══════════╗
║   OLTP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49850000 ║
║ BBBBBBBBBBB... ║ 25000    ║
║ CCCCCCCCCCC... ║ 125000   ║
╚════════════════╩══════════╝


╔════════════════╦══════════╗
║   OLAP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49850000 ║
║ BBBBBBBBBBB... ║ 25000    ║
║ CCCCCCCCCCC... ║ 125000   ║
╚════════════════╩══════════╝
UPDATE dbo.OLTP
SET oltp_data = REPLICATE('D', 100)
WHERE Source_id % 201 = 0;
╔════════════════╦══════════╗
║   OLTP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49601989 ║
║ BBBBBBBBBBB... ║ 24876    ║
║ CCCCCCCCCCC... ║ 124379   ║
║ DDDDDDDDDDD... ║ 248756   ║
╚════════════════╩══════════╝


╔════════════════╦══════════╗
║   OLAP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49601989 ║
║ BBBBBBBBBBB... ║ 24876    ║
║ CCCCCCCCCCC... ║ 124379   ║
║ DDDDDDDDDDD... ║ 248756   ║
╚════════════════╩══════════╝

Consider implementing an ON DELETE trigger on the OLTP table that captures the pertinent primary key values into a third table, if your OLTP system deletes rows. Something like this should work:

IF OBJECT_ID(N'dbo.OLTP_Deletes', N'U') IS NOT NULL
DROP TABLE dbo.OLTP_Deletes;
CREATE TABLE dbo.OLTP_Deletes
(
    Source_id int NOT NULL
        CONSTRAINT OLTP_Deletes_pk
        PRIMARY KEY CLUSTERED
);
GO
CREATE TRIGGER OLTP_delete_trigger ON dbo.OLTP
AFTER DELETE
AS
BEGIN
    SET NOCOUNT ON;
    INSERT INTO dbo.OLTP_Deletes (Source_id)
    SELECT Source_id
    FROM deleted d;
END;
GO

This code will remove the deleted rows from the dbo.OLAP table that have been deleted in the dbo.OLTP table:

BEGIN TRANSACTION
BEGIN TRY
    DELETE FROM dbo.OLAP
    FROM dbo.OLAP o
        INNER JOIN dbo.OLTP_Deletes d ON o.Source_id = d.Source_id
        LEFT JOIN #cci ON (1=0);
    TRUNCATE TABLE dbo.OLTP_Deletes;
    COMMIT TRANSACTION;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0 ROLLBACK TRANSACTION;
    PRINT ERROR_MESSAGE();
END CATCH

Gotchas!

Ensure there is enough CPU and memory in the SQL Server instance to support both the OLTP and the OLAP load. Running both loads in the same SQL Server instance can be difficult. If resource contention on the OLTP system is a primary concern, you may want to use a linked server on the OLAP system to separate resources into two separate SQL Servers. Run the migrate code above on the OLAP system, while monitoring for resource consumption on the OLTP system so you have an understanding of the potential impact.

This method is presented as an efficient alternative to copying rows based on the hash of columns in the source table against a hash of the columns in the destination table. For extreme systems, it may be required to use SSIS or some other bulk-load method of migrating the data from the OLTP system into a staging system to perform the actual rowversion comparison work. If you do that, you’ll need to ensure the destination table in the staging system uses a binary(8) data type for the rv column to ensure the rowversion values from the OLTP system are preserved in staging.

Wrap-up!

Instead of comparing rows column-by-column, use the rowversion data type to efficiently identify rows that have been modified since the last time you looked at them.

Let me know if you have any comments or questions about this code!


1rowversion T-SQL data type
2MIN_ACTIVE_ROWVERSION() T-SQL function