Monitoring SQL Server Agent Job Failures with SolarWinds DPA Custom Alerts

You’ve been using SolarWinds DPA for a while. It catches slow queries, highlights wait stats, and sends you pretty reports. But its built-in SQL Server Agent monitoring has a blind spot that trips up a lot of DBAs.

DPA ships with a “Job Step Failure” alert. Sounds useful — until you realize it fires on every step failure, including the intentional ones. If you’ve ever built a job with on-failure flow control — step 1 tries the fast path, fails by design, and step 2 picks up the alternate route — you know this generates false positives all day long.

What you actually want is a job-level failure alert. SQL Server Agent writes a summary row to [msdb].[dbo].[sysjobhistory] with [step_id] = 0 after every job execution. That row’s [run_status] tells you the overall outcome: 0 = Failed, 1 = Succeeded, 2 = Retry, 3 = Canceled. Filter to step_id = 0 and run_status = 0, and you get the jobs that genuinely failed — no noise from intentional step failures.

Here are two DPA custom alert queries that do exactly this.

A woman DBA reviewing a control panel of server alerts, with a robotic DPA sentinel spotlighting red failure indicators while ignoring amber step-level noise

Option 1: Simple Count (Single Numeric Return)

This version returns a single number: how many jobs failed in the polling interval. Clean and simple — if the count is greater than zero, something failed.

The #FREQUENCY# tag is a DPA placeholder — at runtime, DPA replaces it with the alert’s execution interval in minutes. So if you set the alert to poll every 10 minutes, the query automatically looks back 10 minutes. No hardcoded values to drift out of sync.

Option 2: Which Jobs Failed (Multiple Numeric Return)

The count is useful for triggering the alert, but the first question you’ll ask is “which job?” This version returns job names and their failure counts, so the alert email tells you exactly what broke.

This uses DPA’s “Custom SQL Alert — Multiple Numeric Return” type. DPA expects the first column to be a string identifier and the second column to be the numeric value. When any row’s value meets the threshold, the alert fires — and the notification includes the job names.

Setting It Up in DPA

The setup is the same for both variants (see the SolarWinds DPA custom alert documentation for full details) — just pick the matching alert type:

  1. From the DPA menu, click Alerts, then Manage Alerts.
  2. Select Custom as the category. Choose either Custom SQL Alert — Single Numeric Return (Option 1) or Custom SQL Alert — Multiple Numeric Return (Option 2).
  3. Set the execution interval (e.g., 10 minutes). This must match the lookback window you want — the #FREQUENCY# tag handles the rest.
  4. Paste the query, set Execute Against to “Monitored Database Instance,” and set Units to “Failed Jobs” or “Failures.”
  5. Set thresholds. For a simple alert: Info Min = 1. For tiered alerting: Info 1–2, Low 2–5, Medium 5–10, High 10+.
  6. Configure notification recipients, test the alert (no email is sent during test), and save.

Why step_id = 0?

This is the key insight that makes these queries work. SQL Server Agent records one row per step execution, plus a summary row with step_id = 0 for the overall job outcome. The built-in DPA alert looks at all rows — which means a job that “fails” step 1 intentionally (to route to step 2) shows up as a failure even though the job completed successfully.

By filtering to step_id = 0, you only see genuine job-level failures. The [j].[enabled] = 1 filter skips disabled jobs — no point alerting on a job you’ve intentionally turned off.

The [msdb].[dbo].[agent_datetime] function converts SQL Server Agent’s run_date (int, YYYYMMDD) and run_time (int, HHMMSS) columns into a proper datetime for comparison. It’s a built-in utility function in msdb that replaces the ugly CONVERT/STUFF pattern you’ll see in older scripts.

A Note on Performance

The [msdb].[dbo].[agent_datetime] function is a traditional scalar UDF it won’t be inlined by SQL Server’s query optimizer, even on SQL Server 2019+. Wrapping [run_date] in a function call (or in the CONVERT/STUFF expression it replaces) prevents an index seek on that column, so the query scans [sysjobhistory] and evaluates the function row by row.

For most servers this is a non-issue. The query runs every few minutes, sysjobhistory is typically modest in size, and the scan finishes in milliseconds. But if you’re running on a server with thousands of jobs and years of ungroomed history in msdb, the row-by-row evaluation could add up. If you notice the alert query showing up in DPA’s own wait-time analysis (yes, the irony), consider purging old job history with [msdb].[dbo].[sp_purge_jobhistory] or switching back to the explicit CONVERT approach with a computed column or indexed view.

Making It Sargable

Both versions of the query wrap [run_date] inside a function call whether that’s agent_datetime or the old CONVERT/STUFF expression. Either way, the optimizer can’t seek on [run_date] because it has to evaluate the function for every row before it can compare the result. That’s what “non-sargable” means: the predicate can’t drive an index seek.

The fix is to compare [run_date] and [run_time] as raw integers against precomputed cutoff values. A CTE gives us a clean place to compute those values once, and because it produces a single constant row, the optimizer folds it directly into the plan no temp table, no extra scan.

Here’s Option 1 rewritten:

And Option 2:

The OR in the WHERE clause handles the midnight boundary. If your DPA alert polls at 00:05 with a 10-minute window, the cutoff is 23:55 on the previous day. The first branch ([run_date] > [cut_date]) catches anything from today. The second branch catches jobs from yesterday that ran at or after 23:55. Without both branches, you’d miss failures that straddle midnight.

The tradeoff is readability the agent_datetime version is easier to scan at a glance. For most servers, that’s the right choice. But if your msdb has millions of history rows and the alert query is adding measurable overhead, this version lets the optimizer work with the indexes instead of fighting them.

While We’re Talking About DPA

If you’ve ever wondered why DPA asks for sysadmin permissions, part of the answer is trace flag 2861. DPA runs DBCC TRACEON(2861, -1) at startup to enable what it believes is “zero-cost plan caching.” The problem: TF 2861 only existed in SQL Server 2000 SP3 and SP4. It was removed before SQL Server 2005 shipped. On every version of SQL Server you’re likely running today, this trace flag does nothing at all.

Sean Gallardy who works with SQL Server at Microsoft and has access to the source code confirmed this back in 2019 and contacted SolarWinds about it. As of his last update in 2021, DPA still executes the command. It’s worth checking whether your DPA installation is still doing this and, if so, whether the sysadmin requirement it creates is one you’re comfortable with.

If you’ve got questions or a DPA alert you’ve built that you’re proud of, I’d love to hear about it — find me on Bluesky or LinkedIn.