Archive Data by Date
Tables with a large amount of data can become unwieldy. As a result, it’s not uncommon to archive data by date into an archive table that contains older rows we care about, but don’t need for day-to-day operations. Typically, we might run a daily job to move rows from the main table into the archive table. However, the first time you run the archive job, you may need to move many millions or billions of rows. If you move that many rows in a single batch, the SQL Server transaction log may grow out of control, causing a variety of problems.
The code below shows how to efficiently archive rows based on a datetime
column. The archival process moves rows, in small batches, from the main table to the archive table. If the database uses the SIMPLE recovery model, we run a CHECKPOINT operation between each batch to prevent the transaction log from growing out of control. If the database uses the FULL or BULK-LOGGED recovery model, we take a log backup to the NUL device and CHECKPOINT the database, which also prevents unreasonable log growth. Be aware, if your database uses the FULL or BULK-LOGGED recovery model, those log backups taken to the NUL
device will essentially break the log backup chain. Once this script has archived all the desired rows, you should immediately take a full database backup, and a log backup to a permanent location for disaster recovery purposes.
The Setup
This first bit of code creates the dbo.Transactions
and dbo.Transactions_Archived
tables as part of a minimal, complete, and verifiable example. The dbo.Transactions
table is populated with ~130,000 rows, with one row per minute starting 2019-01-01:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
USE tempdb; SET NOCOUNT ON; DROP TABLE IF EXISTS dbo.Transactions; DROP TABLE IF EXISTS dbo.Transactions_Archived; CREATE TABLE dbo.Transactions ( TransactionID int NOT NULL CONSTRAINT Transactions_pk PRIMARY KEY CLUSTERED IDENTITY(1,1) , TransactionDate datetime NOT NULL INDEX Transactions_TransactionDate NONCLUSTERED , SomeData char(100) NOT NULL ); INSERT INTO dbo.Transactions (TransactionDate, SomeData) SELECT TOP(129600) DATEADD(MINUTE, ROW_NUMBER() OVER (ORDER BY (SELECT (NULL))) - 1, '2019-01-01 00:00:00') , REPLICATE('A', 100) FROM sys.syscolumns sc1 CROSS JOIN sys.syscolumns sc2 CREATE TABLE dbo.Transactions_Archived ( TransactionID int NOT NULL CONSTRAINT Transactions_Archived_pk PRIMARY KEY CLUSTERED , TransactionDate datetime NOT NULL INDEX Transactions_Archived_TransactionDate NONCLUSTERED , SomeData char(1000) NOT NULL ); |
The Archive Script
Now that we have some data in the dbo.Transactions
table, lets archive some of it! We’ll archive all data that is older than 180 days, into the dbo.Transactions_Archived
table. The code uses DELETE
with the OUTPUT ... INTO ...
clause to ensure rows that are deleted from the main table are inserted into the archive table in a single, efficient operation.
The code executes two loops, the outer loop is per-day, and the inner loop is per-batch-sized-number-of-rows. Output is displayed showing the progress. The code can be stopped and re-started, and will automatically continue where it left off, allowing you to run this code over the course of several outage windows, if required.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
DECLARE @SimpleRecovery bit; DECLARE @StartDate datetime; DECLARE @EndDate datetime; DECLARE @BatchSize bigint; DECLARE @RowsAffected bigint; DECLARE @Msg nvarchar(1000); DECLARE @StartRow bigint; DECLARE @EndRow bigint; DECLARE @LogUsedPercent real; SET @SimpleRecovery = COALESCE( ( SELECT CONVERT(bit, 1) FROM sys.databases d WHERE d.database_id = DB_ID() AND d.recovery_model_desc = N'SIMPLE' ) , CONVERT(bit, 0)); SET @BatchSize = 100; SET @StartDate = COALESCE( ( SELECT MIN(t.TransactionDate) FROM dbo.Transactions t ) , '1900-01-01 00:00:00' ); SET @EndDate = DATEADD(DAY, 1, @StartDate); SET @RowsAffected = 1; WHILE @EndDate < DATEADD(DAY, -180, DATEADD(DAY, DATEDIFF(DAY, 0, GETDATE()), 0)) AND @RowsAffected > 0 BEGIN SET @StartRow = 1; SET @EndRow = 0; SET @RowsAffected = @BatchSize; WHILE @RowsAffected = @BatchSize BEGIN DELETE TOP(@BatchSize) t OUTPUT deleted.TransactionID , deleted.TransactionDate , deleted.SomeData INTO dbo.Transactions_Archived ( TransactionID , TransactionDate , SomeData ) FROM dbo.Transactions t WHERE t.TransactionDate >= @StartDate AND t.TransactionDate < @EndDate; SET @RowsAffected = ROWCOUNT_BIG(); SET @EndRow += @RowsAffected; SELECT @LogUsedPercent = su.used_log_space_in_percent FROM sys.dm_db_log_space_usage su; SET @msg = LEFT(N'Archived rows ' + CONVERT(nvarchar(1000), @StartRow) + N' to ' + CONVERT(nvarchar(1000), @EndRow) + REPLICATE(' ', 30), 30) + N', starting at ' + CONVERT(nvarchar(30), @StartDate, 120) + N' and earlier than ' + CONVERT(nvarchar(30), @EndDate, 120) + N'. Using ' + CONVERT(nvarchar(100), CONVERT(decimal(10,2), @LogUsedPercent)) + N' percent of the log file.'; RAISERROR (@msg, 0, 1) WITH NOWAIT; SET @StartRow += @RowsAffected; IF @SimpleRecovery = 1 BEGIN CHECKPOINT; END ELSE BEGIN CHECKPOINT; DECLARE @cmd nvarchar(max); SET @cmd = N'BACKUP LOG ' + QUOTENAME(DB_NAME()) + N' TO DISK = N''NUL:'';'; EXEC sys.sp_executesql @cmd; END END SET @StartDate = @EndDate; SET @EndDate = DATEADD(DAY, 1, @StartDate); END |
Results
The following code summarizes the contents of each table:
1 2 3 4 5 6 7 8 9 10 11 |
SELECT TableName = N'dbo.Transactions_Archived' , StartDate = MIN(ta.TransactionDate) , EndDate = MAX(ta.TransactionDate) , NumRows = COUNT(1) FROM dbo.Transactions_Archived ta; SELECT TableName = N'dbo.Transactions' , StartDate = MIN(t.TransactionDate) , EndDate = MAX(t.TransactionDate) , NumRows = COUNT(1) FROM dbo.Transactions t; |
You can control the number of rows that are archived per batch, by adjusting the @BatchSize
parameter. I have it set to 100 because it makes a nice demo, however you probably want it set to 1,000 or 10,000 rows, depending on the size of each row, how large your transaction log is, how fast your disk subsystem is, and whether you want to prevent lock escalation.
When I ran the code on August 15th, 2019, the output looked like:
╔═══════════════════════════╦═════════════════════════╦═════════════════════════╦═════════╗ ║ TableName ║ StartDate ║ EndDate ║ NumRows ║ ╠═══════════════════════════╬═════════════════════════╬═════════════════════════╬═════════╣ ║ dbo.Transactions_Archived ║ 2019-01-01 00:00:00.000 ║ 2019-02-14 23:59:00.000 ║ 64800 ║ ╚═══════════════════════════╩═════════════════════════╩═════════════════════════╩═════════╝ ╔══════════════════╦═════════════════════════╦═════════════════════════╦═════════╗ ║ TableName ║ StartDate ║ EndDate ║ NumRows ║ ╠══════════════════╬═════════════════════════╬═════════════════════════╬═════════╣ ║ dbo.Transactions ║ 2019-02-15 00:00:00.000 ║ 2019-03-31 23:59:00.000 ║ 64800 ║ ╚══════════════════╩═════════════════════════╩═════════════════════════╩═════════╝
Summary
As you can see above, half of the rows were moved to the dbo.Transactions_Archived
table, while the most recent half were retained in the dbo.Transactions
table. If you run this code in a different date, you’ll see different row counts in the output above, reflecting the fact that the archive script archives rows older than 180 days. I hope this script proves useful when you need to archive data by date. Let me know if you have any questions about it.
Check out the rest of our tools for SQL Server.