Backup Failure and Gap Detection with AI

You don’t discover a backup gap during a calm Tuesday afternoon review. You discover it at 2 AM when a production database needs a point-in-time restore and your last transaction log backup was four hours ago instead of fifteen minutes. Your RPO was 15 minutes. Your actual data loss is 4 hours. That’s a career-defining conversation with the CTO.

Backup jobs can fail silently in dozens of ways. The Agent job succeeds but skips a database because it was in a restoring state. The maintenance plan runs but compression fails on one database and it moves on. A new database gets created and nobody adds it to the backup schedule. An AG failover happens and the backup job on the new primary isn’t configured. These gaps are invisible until you need the backup that doesn’t exist.

I used an AI agent to build a monitoring solution that catches these gaps proactively. Here’s how the conversation went.

The Initial Prompt

What the Agent Produced

The agent built a solid structure: sys.databases as the outer base, LEFT JOIN to subqueries against msdb.dbo.backupset for the latest full (type = 'D'), differential (type = 'I'), and log (type = 'L') backups. It correctly used LEFT JOIN so databases with zero backup history still appeared in the results. It excluded tempdb, model, and snapshots (source_database_id IS NULL).

What I liked: it handled the “completed with errors” check by looking at msdb.dbo.backupset for rows where backup_finish_date exists but the associated msdb.dbo.sysjobhistory entry shows a non-success status. Actually — that approach was overengineered and fragile, because it assumed backups run through Agent jobs. I pushed back:

The agent revised the approach — cleaner and works regardless of how backups are executed (Agent, maintenance plans, Ola Hallengren, third-party tools).

Note: The has_incomplete_metadata and backupmediafamily orphan checks described above are not included in the final consolidated script below. They were useful during initial investigation but added complexity without catching issues that the RPO-based gap detection doesn’t already cover. If you need those checks — for example, to detect corrupted backup records — add them as a separate validation query against msdb.dbo.backupset AS bs LEFT JOIN msdb.dbo.backupmediafamily AS bmf ON bs.[media_set_id] = bmf.[media_set_id] WHERE bmf.[media_set_id] IS NULL.

The Iteration: RPO-Aware Thresholds

Hardcoded thresholds work for a single instance, but my environment has different RPO requirements across tiers. I followed up:

The agent created the config table and rewrote the threshold logic to use a CROSS APPLY that finds the best-matching RPO config for each database. This is the kind of refinement that’s tedious to build from scratch but easy to describe in a prompt — you’re telling the agent the business requirement and letting it handle the join logic.

The Final Solution

What I Validated

Before deploying, I ran this on three production instances with known backup configurations:

  1. The “new database” test. I created a test database and confirmed it immediately appeared in the next monitoring cycle as “NO FULL BACKUP.” This is the most common real-world gap — someone creates a database and forgets to add it to the backup schedule.
  2. The copy-only trap. One instance had a database where the only full backup in backupset was a copy-only taken for a migration test. The scheduled backup job had been failing silently for three days. The COPY_ONLY ONLY flag caught this.
  3. AG failover gap. After a planned AG failover, the backup job on the new primary wasn’t configured (the old primary had the job). The log backup gap alert fired within 30 minutes.
  4. RPO config matching. The LIKE-pattern matching correctly applied tighter thresholds to the Orders databases and relaxed thresholds to the Archive databases. The priority ordering worked as expected when a database matched multiple patterns.

Try This Yourself

  1. Run the core query (without the Database Mail wrapper) on any instance. Compare the results against what you believe your backup situation looks like. Most DBAs find at least one surprise.
  2. Create the BackupRPOConfig table and populate it with your actual RPO requirements. This forces a conversation with the business about what “acceptable backup frequency” actually means — a conversation many teams skip.
  3. Ask the agent to add a check for backup chain integrity: flag databases where a RESTORE HEADERONLY of the most recent full backup would succeed but the chain of log backups since then is incomplete.
  4. Deploy as an Agent job running every 15 minutes. The log backup check is the time-sensitive one — 15 minutes gives you one missed cycle before the alert fires.
  5. Extend with the fleet-wide PowerShell pattern from Post 6 to run across all instances and aggregate results.

For the complete approach to building custom monitoring and alert tuning, see Post 13: Building Custom Monitoring Queries and Alerts. For the broader health check and inventory pattern, see Post 5: Health Checks and Inventory.


Part of the ALTER DBA ADD AGENT series.