How to Corrupt a Database

Microsoft goes to great lengths to ensure data integrity remains its paramount concern with SQL Server data. After all, ACID compliance doesn’t mean much if the underlying data is corrupt. Having said that, there are certainly points-of-failure in every computer system that make corruption inevitable given enough activity over enough time. This post talks about how to corrupt a database, so you’re prepared when it happens in production.

The Hopes of the Party, by James Gillray.  No Corruption Visible Here.

The Hopes of the Party, by James Gillray. No Corruption Visible Here.

Learning how to deal with a corrupt database can be one of the most challenging tasks a DBA will ever undertake. Typically, corrupt databases aren’t a big deal outside of production – you simply restore a new masked-data-copy from production and continue on. But, what if it’s your business’ primary production database that suddenly has corruption, and you’d rather not have to restore a backup? The best way to be prepared for something like that is to have an intentionally corrupted database that you can use for learning purposes. This post shows you how to corrupt a database so you can discover what works best for you before you’re in a Critical Moment™.

The Setup

First things first; let’s create a database:

Since we need some data to corrupt, let’s create a clustered index and add some data to it. In fact, we’ll add 2,000 pages or 16MB of data.

Let’s Corrupt It!

SQL Server includes a built-in undocumented command purpose-built for writing to database pages directly, DBCC WRITEPAGE. That command can be used with equal efficacy to corrupt a page, or to fix corruption on a page.

The syntax for DBCC WRITEPAGE is:

The “offset” parameter indicates the starting address on the page where we should start writing data. The “length” parameter indicates how much data we’ll be writing. The “value” parameter is a hexadecimal value to write into the page; the length of this parameter must match the “length” parameter. The “write_direct” parameter should be 0 if you want DBCC WRITEPAGE to write to a buffer in memory, or 1 if you can put the database into single-user-mode, and have DBCC WRITEPAGE write directly to the data on-disk.

If you attempt to use “write_direct” with a value of 1, but don’t have the database in single user mode, you’ll receive the following error:

Msg 7919, Level 16, State 5, Line 91
Repair statement not processed. Database needs to be in single user mode.

This next piece of code will thoroughly corrupt the dbo.corrupt_table by placing the database into single user mode, then running DBCC WRITEPAGE to randomly alter a single byte on each page of the table. We’re inserting the list of pages used by the table into a temporary table since we won’t be able to directly access the corrupt_db database while using DBCC WRITEPAGE. We then iterate a cursor over the temporary table, calling DBCC WRITEPAGE once against each page. The DBCC WRITEPAGE writes a randomly generated single byte to a randomly chosen location within the page.

What Happened?

The last statement in the code above brings the database back into multi user mode, so we can query the table:

The first page the query attempts to read is detected as corrupt, via the checksum calculation that is performed when each page is read into the buffer pool:

Msg 824, Level 24, State 2, Line 135
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x00caae6e; actual: 0x80caae26). It occurred during a read of page (1:328) in database ID 15 at offset 0x00000000290000 in file 'C:\data\corrupt_db_system.mdf'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

DBCC CHECKDB, super-hero?

Next, lets run a DBCC CHECKDB against the database, as instructed in the error message:

Hmmm… seems there are a boat-load of errors that need fixing. As detailed in the last line of the output, repair_allow_data_loss is the minimum level of repair that will bring the database back to a consistent state. Let’s try that, but first we need to put the database into single user mode:

Ok, the output reports a bunch of pages were fixed. Below is a portion of the output showing items that were repaired in the database.

Repair: The page (1:2354) has been deallocated from object ID 565577053, index ID 1, partition ID 72057594041270272, alloc unit ID 72057594046906368 (type In-row data).
Repair: The page (1:2355) has been deallocated from object ID 565577053, index ID 1, partition ID 72057594041270272, alloc unit ID 72057594046906368 (type In-row data).
Repair: The page (1:2356) has been deallocated from object ID 565577053, index ID 1, partition ID 72057594041270272, alloc unit ID 72057594046906368 (type In-row data).
Repair: The page (1:2357) has been deallocated from object ID 565577053, index ID 1, partition ID 72057594041270272, alloc unit ID 72057594046906368 (type In-row data).
Repair: The page (1:2358) has been deallocated from object ID 565577053, index ID 1, partition ID 72057594041270272, alloc unit ID 72057594046906368 (type In-row data).
Msg 8945, Level 16, State 1, Line 140
Table error: Object ID 565577053, index ID 1 will be rebuilt.
        The error has been repaired.
Msg 8928, Level 16, State 1, Line 140
Object ID 565577053, index ID 1, partition ID 72057594041270272, alloc unit ID 72057594046906368 (type In-row data): Page (1:328) could not be processed.  See other errors for details.
        The error has been repaired.
Msg 8939, Level 16, State 98, Line 140
Table error: Object ID 565577053, index ID 1, partition ID 72057594041270272, alloc unit ID 72057594046906368 (type In-row data), page (1:328). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 2057 and -4.
        The error has been repaired.

Now let’s look at the table, with a simple SELECT statement:

Ok, so some rows still exist, but a whole bunch have been deleted by the DBCC CHECKDB with repair_allow_data_loss option. Since we’re randomly choosing an offset and value to write to each page, the effects of that corruption will vary each time you run the test code.

In the above example, where we lost data, you might need to choose between restoring the entire database or restoring a temporary copy of the database somewhere, so you can copy data from the temporary copy into the production copy. That may or may not work for the business, but it should definitely be a business decision.

Summary

Using DBCC WRITEPAGE in a variety of ways to intentionally corrupt a database is an important learning exercise. Determining a plan for dealing with corruption, and executing the plan to see the outcome, is priceless experience.

Check out the rest of our tools, and let me know if you have any questions or suggestions about this post. Thanks for reading, and please share this post if you found it useful!