Availability Group Health Monitoring with AI

Availability Groups are deceptively quiet when they’re healthy and deceptively cryptic when they’re not. The AG dashboard in SSMS shows green checkmarks until something goes wrong — and by “wrong” I mean a replica has been disconnected for 20 minutes and nobody noticed because the primary is still serving requests just fine. You find out when you need to fail over and can’t.

I needed comprehensive AG health monitoring that goes beyond “is the AG online” to answer the questions that actually matter: Is every replica synchronized? How far behind is the async replica? Is the redo queue growing? Can I fail over right now without data loss?

The Prompt

What the Agent Produced

The agent built a multi-check script with a pattern I liked: each check populates rows in a temp table (#AGHealthIssues) with a severity, check name, and detail message. At the end, if the temp table has any rows, it builds an HTML email from all of them. This is cleaner than separate IF blocks for each check — you get a single consolidated alert showing everything that’s wrong.

The “only run on primary” guard was implemented correctly:

This is critical — you don’t want the monitoring job firing on every replica in the AG.

The Iteration

The first version was missing two things. First, the “estimated data loss” check was naive — it compared LSN values directly, but LSN differences don’t translate linearly to time. I asked for a better approach:

Second, the script didn’t check listener health. On a multi-subnet AG, the listener can be online but pointing to the wrong IP after a failover if DNS hasn’t propagated. I added:

What I Validated

I tested this on a three-node AG (one sync, one async, one config-only witness) in a lab environment by simulating various failure conditions:

  • Pausing the async secondary. The redo queue check fired within two polling cycles as the queue grew past the threshold. Good.
  • Stopping SQL Server on a secondary. The connectivity check detected DISCONNECTED immediately on the next poll. The database join state also flagged correctly.
  • Blocking redo on a secondary (by running a long-running query with READ_COMMITTED_SNAPSHOT off). The redo queue grew and triggered the alert. The detail message showed both the queue size in MB and the affected database name.
  • Checking failover readiness. With the sync secondary healthy, the check passed. After I paused the secondary, it correctly identified that automatic failover would result in data loss.

One adjustment I made: the initial threshold for log_send_queue_size (200 MB) was too aggressive for our async replica during batch processing windows. I parameterized all thresholds and added a time-based exclusion for the nightly ETL window.

The Final Solution

What This Catches That the Dashboard Doesn’t

The SSMS AG dashboard refreshes when you look at it. This script catches the problems that happen at 3 AM when nobody’s looking. Specifically:

  • Redo queue growing during off-hours maintenance. A large index rebuild on the primary can cause the redo queue to grow significantly on secondaries. If you’re reading from the secondary (readable routing), users see stale data until redo catches up.
  • Sync replica falling behind after a network blip. The replica reconnects and starts synchronizing, but the dashboard shows “SYNCHRONIZING” which looks fine — except the redo queue is 2 GB and growing. The queue-size check catches this.
  • Automatic failover readiness silently lost. A sync replica that’s technically connected but not healthy won’t support automatic failover. You won’t know until you need it.

Try This Yourself

  1. Deploy the script on the primary replica of a test AG. Run it manually to see the baseline output.
  2. Simulate a failure: pause the SQL Server service on a secondary, or suspend data movement on a database (ALTER DATABASE [YourDB] SET HADR SUSPEND).
  3. Run the script again — you should see the appropriate checks fire.
  4. Adjust thresholds for your environment. Async replicas across a WAN may need higher send queue thresholds. Sync replicas on a fast LAN should have tight thresholds.
  5. Schedule as an Agent job every 2–5 minutes on the primary. If you have an AG with automatic failover, deploy the job on both replicas — the “primary only” guard ensures it only runs checks on whichever node is currently primary.

For the full approach to building custom monitoring solutions, see Post 13: Building Custom Monitoring Queries and Alerts. For more on using AI for health checks and infrastructure inventory, see Post 5: Health Checks and Inventory.


Part of the ALTER DBA ADD AGENT series.