AI-Generated Availability Group Failover Runbook

Planned AG failovers aren’t technically difficult. ALTER AVAILABILITY GROUP ... FAILOVER is one line of T-SQL. But anyone who’s done them in production knows the real work isn’t the failover command — it’s the 30 minutes of checks before and after. Are all replicas synchronized? Is the redo queue drained? Are applications reconnecting to the listener? Did that third-party monitoring tool lose its mind?

I’ve done enough of these that I have a mental checklist. The problem with mental checklists is that at 6 AM during a patching window, you skip steps. I wanted a scripted runbook that enforces the checklist — and I wanted it built faster than the three hours it would take me to write from scratch.

Here’s the prompt I used:

What the Agent Produced

The agent built a well-structured Invoke-AgFailoverRunbook function with the four phases clearly separated. A few things it got right out of the gate:

  • Pre-checks that actually block. Each check returns a pass/fail, and the script exits if any pre-check fails. No “warning, continuing anyway” — a hard stop.
  • The confirmation prompt showed a formatted summary: current primary, target replica, number of databases, AG name. Clear enough that a sleep-deprived DBA at 6 AM can read it and confirm.
  • The -DryRun mode ran all Phase 1 checks, displayed the planned action, then exited — exactly what I need for pre-validating during change management.

What needed work:

  • Job detection was too broad. The agent checked for any running job. I refined it to look for jobs with names matching patterns like %backup%, %CHECKDB%, %integrity%, %index% — the maintenance jobs that would conflict with a failover.
  • It didn’t check the AG listener. Pre-failover, I want to verify the listener is resolving to the current primary. Post-failover, I want to verify it resolves to the new primary. This confirms DNS and networking are working, not just the AG metadata.
  • Timeout handling was missing. The failover usually completes in seconds, but sometimes the redo queue needs to drain. I asked the agent to add a configurable timeout with a polling loop.

The Iteration

The agent handled all four. The listener DNS check used Resolve-DnsName, which is exactly right.

The Final Runbook

What I Validated

I ran this with -DryRun against a non-production AG first. A few things I caught:

  1. The running-job query needs to cover both replicas. The original only checked the primary. If a CHECKDB job is running on the target secondary, failing over to it would be disruptive. The Invoke-DbaQuery -SqlInstance $currentPrimary, $TargetReplica handles both.
  2. Listener DNS propagation isn’t instant. After failover, Resolve-DnsName may return the old IP for a few seconds. The polling loop with a timeout handles this gracefully — but set your timeout appropriately for your DNS TTL.
  3. The confirmation prompt is crucial. During a patching window with multiple AGs, it’s easy to pass the wrong target replica. The explicit “type FAILOVER” prompt is intentionally friction — a safety net against mistakes at 5 AM.

This script lives in our shared DBA tools repository. Every planned failover uses it. The -DryRun output is included in our change management tickets as evidence that pre-checks passed before we committed to the failover.

For the fleet health checks that feed into this runbook, see Post 5: Health Checks and Inventory. For more PowerShell automation patterns, see Post 6: PowerShell Automation.

Try This Yourself

Take this runbook and run it with -DryRun against one of your AGs. The pre-checks alone are valuable — they’ll tell you whether your AG is actually ready for a failover right now. Most DBAs find at least one surprise: a replica that’s in ASYNCHRONOUS_COMMIT mode when it should be synchronous, a redo queue that’s larger than expected, or a backup job that’s still running from last night.

Then customize it. Ask the agent to add checks specific to your environment: verify that a particular Windows service is running on the target, check that enough disk space exists for tempdb on the failover target, or add a post-failover step that updates a CMDB. Each iteration takes one prompt and a few minutes of review.


Part of the ALTER DBA ADD AGENT series.