The Transactional Outbox: Reliable Messaging Without Distributed Transactions

2026-07-02 · by Hannah Vernon · in Data Architecture

The previous post made retries safe: when the same message arrives twice, the database refuses to act on it twice. That solves the consuming end. This post is about the producing end, and a problem that sounds trivial until you try to do it correctly: how do you change your data and tell the outside world about it, when those are two different systems and you cannot wrap both in one transaction?

This is the dual-write problem, and it is one of the most reliable ways to corrupt a distributed system.

A database with an attached outbox tray, messages relayed by a courier robot to a separate message bus

The Dual-Write Problem

A great many operations need to do two things at once. Create the order and publish an “order placed” event. Record the payment and notify the fulfillment service. Update an aggregate and emit a domain event a saga is waiting on.^[1] The data lives in SQL Server; the event needs to go to a message broker, Service Bus, Kafka, a queue. Two systems, and the operation is only correct if both happen or neither does.

The obvious approaches both fail. If you commit the database transaction and then publish the message, the process can stop between the two, and now the order exists but nobody was told – a lost event. If you publish the message first and then commit, the publish can succeed and the commit can roll back, and now the world has been told about an order that does not exist – a phantom event. Swapping the order just swaps which failure you get. There is no sequence of “do A, then do B” that is safe when the process can stop in the middle.

The classic fix is a distributed transaction spanning the database and the broker, with a two-phase commit coordinator holding both until both are ready. In practice this is the cure that is worse than the disease. It couples the availability of your database to the availability of your broker, so either one being slow stalls the other. It scales poorly, it is operationally heavy, and most modern brokers do not support enlisting in it at all. The pattern exists, and almost nobody who has run it wants to run it again.

One Transaction You Already Have

The insight behind the outbox pattern is that you do not actually need to write to two systems atomically. You need to write to the database atomically, which you already know how to do, and you need that single write to capture the intent to publish. So you add a table.

When the operation runs, it writes its business data and a row describing the message into an outbox table, both in the same local transaction.^[1] One transaction, one atomic outcome: either the order and its pending event both commit, or neither does. The dual write has become a single write, and the impossible guarantee has become an ordinary one.

CREATE TABLE [dbo].[outbox]
(
      [outbox_id]       bigint          NOT NULL 
      IDENTITY(1, 1) 
      CONSTRAINT [pk_outbox] 
      PRIMARY KEY
    , [aggregate_type]  varchar(100)    NOT NULL
    , [aggregate_id]    varchar(100)    NOT NULL
    , [event_type]      varchar(100)    NOT NULL
    , [payload]         nvarchar(max)   NOT NULL
    , [created_utc]     datetime2(3)    NOT NULL 
      CONSTRAINT [df_outbox_created] 
      DEFAULT (SYSUTCDATETIME())
    , [processed_utc]   datetime2(3)    NULL
);
GO

CREATE PROCEDURE [api].[order_place]
(
      @order_id     int
    , @customer_id  int
    , @amount       decimal(19, 4)
)
AS
BEGIN
    SET NOCOUNT ON;
    SET XACT_ABORT ON;

    BEGIN TRANSACTION;

    /* Business data and the intent to publish commit together or not at all. */
    INSERT INTO [dbo].[order] ([order_id], [customer_id], [amount])
    VALUES (@order_id, @customer_id, @amount);

    INSERT INTO [dbo].[outbox] ([aggregate_type], [aggregate_id], [event_type], [payload])
    VALUES
    (
          'order'
        , CONVERT(varchar(11), @order_id)
        , 'order_placed'
        , N'{"order_id":' + CONVERT(varchar(11), @order_id) + N',"amount":' + CONVERT(varchar(40), @amount) + N'}'
    );

    COMMIT TRANSACTION;
END;

CREATE TABLE [dbo].[outbox]

(

[outbox_id] bigint NOT NULL

IDENTITY(1, 1)

CONSTRAINT [pk_outbox]

PRIMARY KEY

, [aggregate_type] varchar(100) NOT NULL

, [aggregate_id] varchar(100) NOT NULL

, [event_type] varchar(100) NOT NULL

, [payload] nvarchar(max) NOT NULL

, [created_utc] datetime2(3) NOT NULL

CONSTRAINT [df_outbox_created]

DEFAULT (SYSUTCDATETIME())

, [processed_utc] datetime2(3) NULL

);

CREATE PROCEDURE [api].[order_place]

(

@order_id int

, @customer_id int

, @amount decimal(19, 4)

)

BEGIN

SET NOCOUNT ON;

SET XACT_ABORT ON;

BEGIN TRANSACTION;

/* Business data and the intent to publish commit together or not at all. */

INSERT INTO [dbo].[order] ([order_id], [customer_id], [amount])

VALUES (@order_id, @customer_id, @amount);

INSERT INTO [dbo].[outbox] ([aggregate_type], [aggregate_id], [event_type], [payload])

VALUES

(

'order'

, CONVERT(varchar(11), @order_id)

, 'order_placed'

, N'{"order_id":' + CONVERT(varchar(11), @order_id) + N',"amount":' + CONVERT(varchar(40), @amount) + N'}'

);

COMMIT TRANSACTION;

END;

Nothing here touches the broker. The procedure’s only job is to record, atomically, that something happened and that a message about it is owed. Getting that message to the broker is a separate job, done by a separate process, and that separation is the whole point.

The Relay: Getting Messages Out of the Outbox

A relay process reads unprocessed rows from the outbox and publishes them to the broker. There are two established ways to build it.

The polling publisher periodically queries the outbox for unprocessed rows, publishes each to the broker, and marks it done.^[2] It is simple, it works on any database, and it is where most teams start. The one piece of care it needs is claiming rows so that two relay instances do not publish the same row twice. In SQL Server you can claim a batch atomically and skip rows another worker has locked:

/* Claim a batch: read the oldest unprocessed rows, skipping any a sibling
   worker already holds, and stamp them so this worker owns them. */
UPDATE
    [o]
SET
    [o].[processed_utc] = SYSUTCDATETIME()
OUTPUT
      [inserted].[outbox_id]
    , [inserted].[aggregate_type]
    , [inserted].[event_type]
    , [inserted].[payload]
FROM
    (
        SELECT TOP (100)
            *
        FROM
            [dbo].[outbox] WITH (ROWLOCK, READPAST, UPDLOCK)
        WHERE
            [processed_utc] IS NULL
        ORDER BY
            [outbox_id]
    ) AS [o];

/* Claim a batch: read the oldest unprocessed rows, skipping any a sibling

worker already holds, and stamp them so this worker owns them. */

UPDATE

[o]

SET

[o].[processed_utc] = SYSUTCDATETIME()

OUTPUT

[inserted].[outbox_id]

, [inserted].[aggregate_type]

, [inserted].[event_type]

, [inserted].[payload]

FROM

(

SELECT TOP (100)

FROM

[dbo].[outbox] WITH (ROWLOCK, READPAST, UPDLOCK)

WHERE

[processed_utc] IS NULL

ORDER BY

[outbox_id]

) AS [o];

The READPAST hint is what lets multiple relay workers run side by side: each skips the rows another worker has locked rather than blocking on them, so they divide the backlog instead of fighting over it. The relay publishes the rows the OUTPUT clause hands back, and because publishing happens after the claim, a crash mid-publish simply leaves the row marked processed-but-unsent, which is why the consuming side must be idempotent. More on that in a moment.

Transaction log tailing takes a different route: instead of polling a table, a process reads the database transaction log and turns committed inserts into the outbox into messages.^[3] On SQL Server this is the territory of Change Data Capture and tools like Debezium that consume it. It avoids the polling latency and the write back to mark rows processed, at the cost of more moving parts and a tighter coupling to the engine’s internals. The right choice depends on your latency budget and your appetite for operational complexity; polling is the honest default, log tailing is the optimization.

Where Idempotency Comes Back

The outbox guarantees at-least-once publishing, not exactly-once. A relay can publish a message, then stop before it records that it did, and on restart publish it again. This is not a flaw to be engineered away; it is the same fundamental limit from the last post, and it has the same resolution. At-least-once publishing on the producing side, plus idempotent processing on the consuming side, equals effectively-once end to end. The outbox and the idempotency key are two halves of one design: the outbox makes sure the message is never lost, and the idempotency key makes sure a message delivered twice is only acted on once. Either alone is incomplete.

A few operational realities round it out. The outbox table needs a retention job, because processed rows accumulate; purge them on a schedule once you are confident they have been delivered. A message that fails to publish repeatedly is a poison message, and the relay needs somewhere to set it aside after enough attempts so it does not block the rest of the backlog. And if consumers depend on event order, you have to preserve it deliberately – publishing in outbox_id order, and being careful about how many relay workers run in parallel for a given aggregate.

A Behavioral Promise the Schema Does Not Show

The outbox table looks like nothing special: an identity column, some varchars, a processed timestamp. What makes it a contract is the behavior wrapped around it – that every committed state change writes its event in the same transaction, that a relay delivers each event at least once, that consumers are built to tolerate the duplicate that guarantee implies. None of that is visible in CREATE TABLE. It is a promise the system makes about the relationship between its data and its events, kept by discipline rather than declared by a column, which is the thread running through every post in this back half of the series.

The next promise is about the past rather than the future: not “tell the world what just changed,” but “be able to answer, later, what changed, when, and who changed it.” That is the job of temporal tables and change data capture.

Have you run an outbox in production, or fought the dual-write problem another way? I would like to hear about it on Bluesky or LinkedIn.

References and Footnotes

Pattern: Transactional outbox – Chris Richardson, microservices.io. The pattern of writing business data and a message into an outbox table in the same local transaction to avoid the dual-write problem. ↩
Pattern: Polling publisher – Chris Richardson, microservices.io. Publishing outbox messages by periodically polling the table for unprocessed rows. ↩
Pattern: Transaction log tailing – Chris Richardson, microservices.io. Publishing outbox messages by reading the database transaction log rather than polling, the approach behind Change Data Capture tooling such as Debezium. ↩

Tags: CDC, Data Architecture, Messaging, Outbox, Reliability, SQL Server, t-sql

The Transactional Outbox: Reliable Messaging Without Distributed Transactions

The Dual-Write Problem

One Transaction You Already Have

The Relay: Getting Messages Out of the Outbox

Where Idempotency Comes Back

A Behavioral Promise the Schema Does Not Show

References and Footnotes

Related

Let us know your thoughts on this postCancel reply

Search

Categories

Pages

Meta

The Transactional Outbox: Reliable Messaging Without Distributed Transactions

The Dual-Write Problem

One Transaction You Already Have

The Relay: Getting Messages Out of the Outbox

Where Idempotency Comes Back

A Behavioral Promise the Schema Does Not Show

References and Footnotes

Share this:

Related

Let us know your thoughts on this postCancel reply

Search

Categories

Pages

Meta