Efficiently Update Data from OLTP to OLAP

2019-02-11 · by Hannah Vernon · in reporting, t-sql

OLTP to OLAP: Acronym Time!

Online Transaction Processing, or OLTP, serves data to clients for line-of-business applications, enabling core business functionality. Online Analytical Processing, or OLAP, consists of analyzing OLTP data with the intent of providing reporting and business intelligence to management. One of our most important tasks is providing highly up-to-date data from the OLTP system into the OLAP system. Fresh data in the OLAP system allows business management to react quickly, whereas stale data in the OLAP system prevents quick, accurate decision making.

Copy all the things from OLTP to OLAP!

The last thing you want to do to a large, busy, OLTP system is to start running reporting against it. Even though this might provide business management with the freshest data possible, it will have the unfortunate side-effect of slowing down customer interaction. That might even have a negative effect on core business functionality, such as sales via a website, etc.

Get to the point!

This post presents a method that allows you to efficiently copy rows from OLTP tables into OLAP tables without needing to compare rows to see what changes have been made.

The code below creates a set of test tables; dbo.OLTP which will be the source data, and dbo.OLAP which we’ll efficiently copy fresh data into.

IF OBJECT_ID(N'dbo.OLTP', N'U') IS NOT NULL
DROP TABLE dbo.OLTP;

IF OBJECT_ID(N'dbo.OLAP', N'U') IS NOT NULL
DROP TABLE dbo.OLAP;

IF OBJECT_ID(N'dbo.OLAP_Updates', N'U') IS NOT NULL
DROP TABLE dbo.OLAP_Updates;
GO

CREATE TABLE dbo.OLTP
(
    Source_id int NOT NULL
        PRIMARY KEY
        CLUSTERED
    , oltp_data char(100) NOT NULL
    , rv rowversion NOT NULL
) ON [PRIMARY]
WITH (DATA_COMPRESSION = PAGE);

CREATE UNIQUE NONCLUSTERED INDEX OLTP_rv
ON dbo.OLTP(rv)
WITH (
    DATA_COMPRESSION = PAGE
    , FILLFACTOR = 100
    );

IF OBJECT_ID(N'dbo.OLTP', N'U') IS NOT NULL

DROP TABLE dbo.OLTP;

IF OBJECT_ID(N'dbo.OLAP', N'U') IS NOT NULL

DROP TABLE dbo.OLAP;

IF OBJECT_ID(N'dbo.OLAP_Updates', N'U') IS NOT NULL

DROP TABLE dbo.OLAP_Updates;

CREATE TABLE dbo.OLTP

(

Source_id int NOT NULL

PRIMARY KEY

CLUSTERED

, oltp_data char(100) NOT NULL

, rv rowversion NOT NULL

) ON [PRIMARY]

WITH (DATA_COMPRESSION = PAGE);

CREATE UNIQUE NONCLUSTERED INDEX OLTP_rv

ON dbo.OLTP(rv)

WITH (

DATA_COMPRESSION = PAGE

, FILLFACTOR = 100

);

Note, the dbo.OLTP table has a column named rv, that is of type rowversion¹. This column is solely used to track changes made to rows, and allows us to efficiently seek to new rows via the unique, non-clustered, index named OLTP_rv.

This code inserts 50,000,000 rows into the OLTP table:

INSERT INTO dbo.OLTP WITH (TABLOCKX) (Source_id, oltp_data) 
SELECT TOP(50000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    , oltp_data = REPLICATE('A', 100)
FROM sys.syscolumns sc
    CROSS JOIN sys.syscolumns sc2
    CROSS JOIN sys.syscolumns sc3;

INSERT INTO dbo.OLTP WITH (TABLOCKX) (Source_id, oltp_data)

SELECT TOP(50000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

, oltp_data = REPLICATE('A', 100)

FROM sys.syscolumns sc

CROSS JOIN sys.syscolumns sc2

CROSS JOIN sys.syscolumns sc3;

Here, I create a couple of tables to use for analytics, or OLAP:

CREATE TABLE dbo.OLAP
(
    Source_id int NOT NULL
        PRIMARY KEY
        CLUSTERED
    , oltp_data char(100) NOT NULL
) ON [PRIMARY]
WITH (DATA_COMPRESSION = PAGE);

CREATE TABLE dbo.OLAP_Updates
(
    SourceSchema sysname NOT NULL
    , SourceTableName sysname NOT NULL
    , DestSchema sysname NOT NULL
    , DestTableName sysname NOT NULL
    , max_rv_captured binary(8) NOT NULL
    , CONSTRAINT OLA_Updates_pk
        PRIMARY KEY 
        CLUSTERED
        (SourceSchema, SourceTableName)
) ON [PRIMARY]
WITH (DATA_COMPRESSION = PAGE);

CREATE TABLE dbo.OLAP

(

Source_id int NOT NULL

PRIMARY KEY

CLUSTERED

, oltp_data char(100) NOT NULL

) ON [PRIMARY]

WITH (DATA_COMPRESSION = PAGE);

CREATE TABLE dbo.OLAP_Updates

(

SourceSchema sysname NOT NULL

, SourceTableName sysname NOT NULL

, DestSchema sysname NOT NULL

, DestTableName sysname NOT NULL

, max_rv_captured binary(8) NOT NULL

, CONSTRAINT OLA_Updates_pk

PRIMARY KEY

CLUSTERED

(SourceSchema, SourceTableName)

) ON [PRIMARY]

WITH (DATA_COMPRESSION = PAGE);

The 2nd table, dbo.OLAP_Updates, supports multiple source and destination tables, one per row, and is used to store the maximum row version we’ve migrated from the source OLTP table into the destination OLAP table.

These statements show the state of the OLTP and OLAP tables:

SELECT [OLTP] = OLTP.oltp_data 
    , [Rows] = COUNT(1) 
FROM dbo.OLTP
GROUP BY OLTP.oltp_data

SELECT [OLAP] = OLAP.oltp_data 
    , [Rows] = COUNT(1)
FROM dbo.OLAP
GROUP BY OLAP.oltp_data;

SELECT [OLTP] = OLTP.oltp_data

, [Rows] = COUNT(1)

FROM dbo.OLTP

GROUP BY OLTP.oltp_data

SELECT [OLAP] = OLAP.oltp_data

, [Rows] = COUNT(1)

FROM dbo.OLAP

GROUP BY OLAP.oltp_data;

Initially, the output looks like this, since we haven’t copied any rows into the OLAP table:

╔════════════════╦══════════╗
║   OLTP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 50000000 ║
╚════════════════╩══════════╝

╔════════════════╦══════════╗
║   OLAP         ║   Rows   ║
╠════════════════╬══════════╣
║                ║          ║
╚════════════════╩══════════╝

The Fun Bit!

This code migrates new rows, as well as modified rows, in an efficient manner, using the MIN_ACTIVE_ROWVERSION()² function:

SET NOCOUNT ON;
DECLARE @MaxRvCaptured binary(8);
    
IF OBJECT_ID(N'tempdb..#cci', N'U') IS NOT NULL
DROP TABLE #cci;
IF OBJECT_ID(N'tempdb..#rows', N'U') IS NOT NULL
DROP TABLE #rows;
IF OBJECT_ID(N'tempdb..#olap_updates', N'U') IS NOT NULL
DROP TABLE #olap_updates;

-- #olap_udpates is used to facilitate an efficient UPSERT operation 
CREATE TABLE #olap_updates
(
    indicator char(1) NOT NULL
);
-- #rows stores primary keys for rows that need to be copied into the OLAP table
CREATE TABLE #rows
(
    Source_id int NOT NULL
        PRIMARY KEY
        CLUSTERED
);
-- #cci is a dummy table used to allow BATCH mode for certain operations
CREATE TABLE #cci
(
    i int
    , INDEX cci_cx CLUSTERED COLUMNSTORE 
);

BEGIN TRANSACTION;
BEGIN TRY
    --get the last used rowversion from the previous time this code ran
    SET @MaxRvCaptured = (
        SELECT o.max_rv_captured 
        FROM dbo.OLAP_Updates o 
        WHERE o.SourceSchema = N'dbo' 
            AND o.SourceTableName = N'OLTP' 
            AND o.DestSchema = N'dbo' 
            AND o.DestTableName = N'OLAP'
            );

    --initialize @MaxRvCaptured if this is the first time this code has ran.
    IF @MaxRvCaptured IS NULL
    BEGIN
        SET @MaxRvCaptured = CONVERT(binary(8), 0);
    END

    --capture the primary keys for modified rows from the source table into a #temp table
    INSERT INTO #rows WITH (TABLOCKX) (Source_id)
    SELECT OLTP.Source_id
    FROM dbo.OLTP WITH (INDEX (OLTP_rv))
        LEFT JOIN #cci ON (1 = 0) --this allows batch mode on row-store tables in 2014+
    WHERE OLTP.rv >= @MaxRvCaptured;

    --delete rows from the destination table that are going to be updated
    DELETE FROM dbo.OLAP
    FROM dbo.OLAP
        INNER JOIN #rows r ON OLAP.Source_id = r.Source_id
        LEFT JOIN #cci ON (1 = 0);

    --insert new and modified rows from the source into the destination
    INSERT INTO dbo.OLAP (Source_id, oltp_data)
    SELECT OLTP.Source_id, OLTP.oltp_data
    FROM dbo.OLTP
        INNER JOIN #rows r ON OLTP.Source_id = r.Source_id
        LEFT JOIN #cci ON (1 = 0);

    --capture the MIN_ACTIVE_ROWVERSION() 
    UPDATE dbo.OLAP_Updates
    SET max_rv_captured = MIN_ACTIVE_ROWVERSION()
    OUTPUT 'Y' INTO #olap_updates (indicator)
    WHERE SourceSchema = N'dbo' 
            AND SourceTableName = N'OLTP' 
            AND DestSchema = N'dbo' 
            AND DestTableName = N'OLAP';

    --if OLAP_Updates doesn't contain a row for the set of tables we're dealing with, insert one.
    INSERT INTO dbo.OLAP_Updates (SourceSchema, SourceTableName, DestSchema, DestTableName, max_rv_captured)
    SELECT *
    FROM (VALUES (N'dbo', N'OLTP', N'dbo', N'OLAP', MIN_ACTIVE_ROWVERSION())) 
        v(SourceSchema, SourceTableName, DestSchema, DestTableName, max_rv_captured)
    WHERE NOT EXISTS (SELECT 1 FROM #olap_updates);

    COMMIT TRANSACTION;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0
    BEGIN
        ROLLBACK TRANSACTION
    END
    PRINT ERROR_MESSAGE()
END CATCH

SET NOCOUNT ON;

DECLARE @MaxRvCaptured binary(8);

IF OBJECT_ID(N'tempdb..#cci', N'U') IS NOT NULL

DROP TABLE #cci;

IF OBJECT_ID(N'tempdb..#rows', N'U') IS NOT NULL

DROP TABLE #rows;

IF OBJECT_ID(N'tempdb..#olap_updates', N'U') IS NOT NULL

DROP TABLE #olap_updates;

-- #olap_udpates is used to facilitate an efficient UPSERT operation

CREATE TABLE #olap_updates

(

indicator char(1) NOT NULL

);

-- #rows stores primary keys for rows that need to be copied into the OLAP table

CREATE TABLE #rows

(

Source_id int NOT NULL

PRIMARY KEY

CLUSTERED

);

-- #cci is a dummy table used to allow BATCH mode for certain operations

CREATE TABLE #cci

(

i int

, INDEX cci_cx CLUSTERED COLUMNSTORE

);

BEGIN TRANSACTION;

BEGIN TRY

--get the last used rowversion from the previous time this code ran

SET @MaxRvCaptured = (

SELECT o.max_rv_captured

FROM dbo.OLAP_Updates o

WHERE o.SourceSchema = N'dbo'

AND o.SourceTableName = N'OLTP'

AND o.DestSchema = N'dbo'

AND o.DestTableName = N'OLAP'

);

--initialize @MaxRvCaptured if this is the first time this code has ran.

IF @MaxRvCaptured IS NULL

BEGIN

SET @MaxRvCaptured = CONVERT(binary(8), 0);

END

--capture the primary keys for modified rows from the source table into a #temp table

INSERT INTO #rows WITH (TABLOCKX) (Source_id)

SELECT OLTP.Source_id

FROM dbo.OLTP WITH (INDEX (OLTP_rv))

LEFT JOIN #cci ON (1 = 0) --this allows batch mode on row-store tables in 2014+

WHERE OLTP.rv >= @MaxRvCaptured;

--delete rows from the destination table that are going to be updated

DELETE FROM dbo.OLAP

FROM dbo.OLAP

INNER JOIN #rows r ON OLAP.Source_id = r.Source_id

LEFT JOIN #cci ON (1 = 0);

--insert new and modified rows from the source into the destination

INSERT INTO dbo.OLAP (Source_id, oltp_data)

SELECT OLTP.Source_id, OLTP.oltp_data

FROM dbo.OLTP

INNER JOIN #rows r ON OLTP.Source_id = r.Source_id

LEFT JOIN #cci ON (1 = 0);

--capture the MIN_ACTIVE_ROWVERSION()

UPDATE dbo.OLAP_Updates

SET max_rv_captured = MIN_ACTIVE_ROWVERSION()

OUTPUT 'Y' INTO #olap_updates (indicator)

WHERE SourceSchema = N'dbo'

AND SourceTableName = N'OLTP'

AND DestSchema = N'dbo'

AND DestTableName = N'OLAP';

--if OLAP_Updates doesn't contain a row for the set of tables we're dealing with, insert one.

INSERT INTO dbo.OLAP_Updates (SourceSchema, SourceTableName, DestSchema, DestTableName, max_rv_captured)

SELECT *

FROM (VALUES (N'dbo', N'OLTP', N'dbo', N'OLAP', MIN_ACTIVE_ROWVERSION()))

v(SourceSchema, SourceTableName, DestSchema, DestTableName, max_rv_captured)

WHERE NOT EXISTS (SELECT 1 FROM #olap_updates);

COMMIT TRANSACTION;

END TRY

BEGIN CATCH

IF @@TRANCOUNT > 0

BEGIN

ROLLBACK TRANSACTION

END

PRINT ERROR_MESSAGE()

END CATCH

The MIN_ACTIVE_ROWVERSION() function captures the minimum rowversion value currently active in the database. Capturing that value ensures we’ll never miss rows that have been updated in the OLTP data.

After the first run of the migrate code, we see the OLAP table contains all the rows from the OLTP table; the initial run will take a fair amount of resources if your source table has lots of pre-existing data:

╔════════════════╦══════════╗
║   OLTP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 50000000 ║
╚════════════════╩══════════╝


╔════════════════╦══════════╗
║   OLAP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 50000000 ║
╚════════════════╩══════════╝

The OLTP table implements a rowversion column that is automatically updated whenever a row is updated or inserted. The rowversion number is unique at the database level, and increments monotonically for all transactions that take place within the context of that database. The dbo.OLTP_Updates table is used to store the minimum row version available inside the transaction used to copy data from the OLTP table into the OLAP table. Each time this code runs it captures incremental changes. This is far more efficient than comparing all the rows in both tables using a hashing function since this method doesn’t require reading any data other than the source data that is either new, or has changed.

Note the code above uses the temporary clustered columnstore index, #cci, to prompt the query processor to use batch mode for certain operations; this is only available in SQL Server 2014+.

Results Time!

In the code below, we’re updating a portion of the 50 million rows, simulating the OLTP system making changes to data as a result of client interaction. 1 in every 1000 rows are updated to B..., 1 in every 400 rows are updated to ‘C…’, then 1 in every 201 rows are updated to D.... The results shown reflect running the rowversion-based migration code after each update.

UPDATE dbo.OLTP
SET oltp_data = REPLICATE('B', 100)
WHERE Source_id % 1000 = 0;

UPDATE dbo.OLTP

SET oltp_data = REPLICATE('B', 100)

WHERE Source_id % 1000 = 0;

╔════════════════╦══════════╗
║   OLTP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49950000 ║
║ BBBBBBBBBBB... ║ 50000    ║
╚════════════════╩══════════╝


╔════════════════╦══════════╗
║   OLAP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49950000 ║
║ BBBBBBBBBBB... ║ 50000    ║
╚════════════════╩══════════╝

UPDATE dbo.OLTP
SET oltp_data = REPLICATE('C', 100)
WHERE Source_id % 400 = 0;

UPDATE dbo.OLTP

SET oltp_data = REPLICATE('C', 100)

WHERE Source_id % 400 = 0;

╔════════════════╦══════════╗
║   OLTP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49850000 ║
║ BBBBBBBBBBB... ║ 25000    ║
║ CCCCCCCCCCC... ║ 125000   ║
╚════════════════╩══════════╝


╔════════════════╦══════════╗
║   OLAP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49850000 ║
║ BBBBBBBBBBB... ║ 25000    ║
║ CCCCCCCCCCC... ║ 125000   ║
╚════════════════╩══════════╝

UPDATE dbo.OLTP
SET oltp_data = REPLICATE('D', 100)
WHERE Source_id % 201 = 0;

UPDATE dbo.OLTP

SET oltp_data = REPLICATE('D', 100)

WHERE Source_id % 201 = 0;

╔════════════════╦══════════╗
║   OLTP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49601989 ║
║ BBBBBBBBBBB... ║ 24876    ║
║ CCCCCCCCCCC... ║ 124379   ║
║ DDDDDDDDDDD... ║ 248756   ║
╚════════════════╩══════════╝


╔════════════════╦══════════╗
║   OLAP         ║   Rows   ║
╠════════════════╬══════════╣
║ AAAAAAAAAAA... ║ 49601989 ║
║ BBBBBBBBBBB... ║ 24876    ║
║ CCCCCCCCCCC... ║ 124379   ║
║ DDDDDDDDDDD... ║ 248756   ║
╚════════════════╩══════════╝

Consider implementing an ON DELETE trigger on the OLTP table that captures the pertinent primary key values into a third table, if your OLTP system deletes rows. Something like this should work:

IF OBJECT_ID(N'dbo.OLTP_Deletes', N'U') IS NOT NULL
DROP TABLE dbo.OLTP_Deletes;
CREATE TABLE dbo.OLTP_Deletes
(
    Source_id int NOT NULL
        CONSTRAINT OLTP_Deletes_pk
        PRIMARY KEY CLUSTERED
);
GO
CREATE TRIGGER OLTP_delete_trigger ON dbo.OLTP
AFTER DELETE
AS
BEGIN
    SET NOCOUNT ON;
    INSERT INTO dbo.OLTP_Deletes (Source_id)
    SELECT Source_id
    FROM deleted d;
END;
GO

IF OBJECT_ID(N'dbo.OLTP_Deletes', N'U') IS NOT NULL

DROP TABLE dbo.OLTP_Deletes;

CREATE TABLE dbo.OLTP_Deletes

(

Source_id int NOT NULL

CONSTRAINT OLTP_Deletes_pk

PRIMARY KEY CLUSTERED

);

CREATE TRIGGER OLTP_delete_trigger ON dbo.OLTP

AFTER DELETE

BEGIN

SET NOCOUNT ON;

INSERT INTO dbo.OLTP_Deletes (Source_id)

SELECT Source_id

FROM deleted d;

END;

This code will remove the deleted rows from the dbo.OLAP table that have been deleted in the dbo.OLTP table:

BEGIN TRANSACTION
BEGIN TRY
    DELETE FROM dbo.OLAP
    FROM dbo.OLAP o
        INNER JOIN dbo.OLTP_Deletes d ON o.Source_id = d.Source_id
        LEFT JOIN #cci ON (1=0);
    TRUNCATE TABLE dbo.OLTP_Deletes;
    COMMIT TRANSACTION;
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0 ROLLBACK TRANSACTION;
    PRINT ERROR_MESSAGE();
END CATCH

BEGIN TRANSACTION

BEGIN TRY

DELETE FROM dbo.OLAP

FROM dbo.OLAP o

INNER JOIN dbo.OLTP_Deletes d ON o.Source_id = d.Source_id

LEFT JOIN #cci ON (1=0);

TRUNCATE TABLE dbo.OLTP_Deletes;

COMMIT TRANSACTION;

END TRY

BEGIN CATCH

IF @@TRANCOUNT > 0 ROLLBACK TRANSACTION;

PRINT ERROR_MESSAGE();

END CATCH

Gotchas!

Ensure there is enough CPU and memory in the SQL Server instance to support both the OLTP and the OLAP load. Running both loads in the same SQL Server instance can be difficult. If resource contention on the OLTP system is a primary concern, you may want to use a linked server on the OLAP system to separate resources into two separate SQL Servers. Run the migrate code above on the OLAP system, while monitoring for resource consumption on the OLTP system so you have an understanding of the potential impact.

This method is presented as an efficient alternative to copying rows based on the hash of columns in the source table against a hash of the columns in the destination table. For extreme systems, it may be required to use SSIS or some other bulk-load method of migrating the data from the OLTP system into a staging system to perform the actual rowversion comparison work. If you do that, you’ll need to ensure the destination table in the staging system uses a binary(8) data type for the rv column to ensure the rowversion values from the OLTP system are preserved in staging.

Wrap-up!

Instead of comparing rows column-by-column, use the rowversion data type to efficiently identify rows that have been modified since the last time you looked at them.

Let me know if you have any comments or questions about this code!

¹rowversion T-SQL data type
²MIN_ACTIVE_ROWVERSION() T-SQL function

Tags: OLAP, OLTP

Efficiently Update Data from OLTP to OLAP

OLTP to OLAP: Acronym Time!

Get to the point!

The Fun Bit!

Results Time!

Gotchas!

Wrap-up!

Related

One Response

Leave a ReplyCancel reply

Search

Categories

Pages

Meta