Mapping Availability Group Topology with AI

If you manage more than a couple of Availability Groups, you’ve probably experienced the moment where someone asks “which replicas are synchronous on AG3?” and you have to connect to the primary, open the dashboard, and click around. Or worse, someone changes a replica from asynchronous to synchronous commit “for testing” and forgets to change it back.

I wanted two things: a diagnostic query that shows the complete AG topology in one result set, and a way to detect misconfigurations automatically. The AI agent built both — and the iterative process of refining the output is a good example of how these conversations work in practice.

The First Prompt

What the Agent Produced

The agent generated three separate queries — a smart choice, because cramming all this into one result set would be unreadable:

  1. Replica overview: AG name, replica server name, role, availability mode, failover mode, seeding mode, synchronization health, connected state, endpoint URL.
  2. Database-level detail: AG name, database name, replica server name, synchronization state, log send queue size (KB), redo queue size (KB), last commit time.
  3. Listener and routing: AG name, listener DNS name, IP address, port, read-only routing URL per replica, read-only routing list.

The join structure was correct — sys.availability_replicas joined to sys.dm_hadr_availability_replica_states on replica_id, with the database-level DMV joined on both replica_id and group_database_id. The agent correctly used LEFT JOIN for the listener queries since not every AG has a listener configured (distributed AGs, for example).

The Iteration: Misconfiguration Detection

The raw topology data is useful, but I wanted the script to actively flag problems. Here’s the follow-up prompt:

The agent produced a UNION ALL query that checked each condition and returned a uniform result set: [Severity], [Category], [AGName], [ReplicaServer], [DatabaseName], [Finding]. Clean and actionable.

I refined one thing: the redo and log send queue thresholds. 500 MB is a reasonable starting point, but in environments with high transaction volumes or large databases, a sustained queue of 200 MB might already indicate a problem. I parameterized the thresholds:

The Final Script

What I Validated

A few things the agent got right that I might have missed if I’d written this from scratch:

  • The suspend_reason_desc column. When data movement is suspended, this tells you why — administrator action, partner shutdown, redo error, etc. The agent included it automatically, which saves a troubleshooting step.
  • Commit lag estimation. The last_commit_time difference gives you a rough lag estimate, though it’s not a precise SLA metric. The agent added a helpful comment noting that this is an approximation, not a guaranteed latency measurement.
  • Left joins on listener tables. Not every AG has a listener (especially in dev environments or distributed AG configurations), so inner joins would silently hide those AGs.

One thing I added post-generation: a note that this query must run on the primary replica to see full topology. On a secondary, sys.dm_hadr_availability_replica_states only shows the local replica’s state. The agent didn’t mention this, and it’s a common gotcha.

Beyond the Query: Topology Documentation

One powerful follow-up is asking the agent to generate a human-readable topology document from the query output. I pasted a sample result set into the conversation and prompted:

The agent produced a clean document with tables and callouts for each misconfiguration. That’s the kind of documentation that usually doesn’t get written because nobody has time — but with the query output and one prompt, you’ve got it.

Try This Yourself

Run the topology query on any instance with Availability Groups. Even if you think your AG configuration is clean, the misconfiguration detection section often surfaces surprises: a secondary that’s been quietly accumulating redo queue, a read-only routing URL that points to a decommissioned server, or a replica someone switched to synchronous commit during a failover test and never switched back.

For deeper AG monitoring — health tracking over time rather than point-in-time snapshots — see the approach discussed in Post 5. The point-in-time diagnostic here is a complement, not a replacement, for continuous monitoring.


Part of the ALTER DBA ADD AGENT series.