- 01What SQL Server monitoring should catch
- 02How to build SQL Server monitoring coverage
- 03SQL Server monitoring tools and where they fit
- 04Best third-party SQL Server monitoring tools
- 05How to choose a SQL Server monitoring tool
- 06What to monitor in SQL Server
- 07SQL Server monitoring alerts that need action
- 08How to monitor SQL Server performance and blocking
- 09How to monitor SQL Server backups, jobs, and DBCC
- 10How to monitor SQL Server storage, tempdb, and capacity
- 11How to monitor SQL Server HA, DR, and failover
- 12How to build a SQL Server monitoring baseline
- 13What not to do with SQL Server monitoring
- 14When to request a SQL Server monitoring review
sql server hub / monitoring guide
SQL Servermonitoring guide
SQL Server monitoring should catch the problems that matter before a user report is the first alert.
A useful setup checks backups, jobs, DBCC, blocking, deadlocks, waits, file latency, tempdb, Query Store, HA and DR state, and the alert route people actually use.
What SQL Server monitoring should catch
Uptime is only one check. A SQL Server can accept connections while backups are stale, DBCC has not run, jobs are failing, tempdb is filling, and blocking is already hurting users.
Good monitoring watches the work SQL Server depends on, not only whether the service is online. It should notice when recovery work is missing, when scheduled work is failing, when users are waiting, when storage is becoming the bottleneck, and when HA or DR state has changed.
Monitoring is not a dashboard by itself. It needs collection, retention, thresholds, alert routing, first checks, and someone responsible for response. The useful question is not just whether there is a graph, but whether the graph, alert, or history would help someone act before the problem becomes a user report.
| Monitoring area | What it should catch | Why it matters |
|---|---|---|
| Recovery | Missed backups, broken log backup patterns, stale restore testing, and missing backup alerts. | Backup history is not useful if nobody notices gaps until a restore is needed. |
| Maintenance | Failed jobs, stale CHECKDB, cleanup failures, long-running maintenance, and schedule overlap. | SQL Agent can look quiet while important work has stopped or drifted into business hours. |
| Workload | Blocking chains, deadlocks, high waits, slow queries, plan regressions, and Query Store changes. | Performance issues need timing, query text, plans, and session details, not only average CPU graphs. |
| Storage | File latency, free space, log growth, tempdb pressure, version-store growth, and abnormal file growth. | Storage pressure can appear as slow queries, blocking, failed backups, or database growth incidents. |
| Availability | AG replica health, synchronization state, listener problems, failover events, and DR jobs that stopped running. | HA and DR only help when the monitoring system sees role changes and broken dependencies. |
| Response | Alerts that are routed, acknowledged, repeated when needed, and tied to a first-check note. | A correct alert still fails if it goes to an abandoned mailbox or gives nobody enough context to start. |
How to build SQL Server monitoring coverage
Build the monitoring setup from production failures backward. Decide which failures must be caught, which SQL Server data source can show them, how often to collect it, how long to keep it, and who gets the alert.
The order matters: critical failures first, data sources second, collection method third, then retention, thresholds, alert routing, response notes, and a regular review cycle.
| Build step | What to decide | Why it matters |
|---|---|---|
| Failure list | Backups, jobs, DBCC, blocking, deadlocks, storage, tempdb, Query Store, HA, and error-log events. | Monitoring should catch failures that affect recovery, performance, or support work. |
| Data source | DMV, Query Store, Extended Events, SQL Agent history, msdb, error log, Performance Monitor, or vendor collector. | The source controls what can be trusted, retained, and alerted on. |
| Collection | Polling interval, collector location, permissions, repository, and retention. | A one-time query is useful, but it is not monitoring. |
| Alert rule | Threshold, delay, repeat behavior, severity, and suppression rules. | Bad thresholds create noise; weak thresholds miss incidents. |
| Response | First check, escalation path, and the person responsible for action. | An alert without a response path becomes another ignored email. |
| Review cycle | Monthly or quarterly alert review, false-positive cleanup, and missed-failure review. | The setup needs tuning as databases, jobs, workload, and HA topology change. |
How to check SQL Server instance context
Reads server identity, edition, startup time, CPU, memory, clustered state, and HADR state.
sqlserver_start_time tells you how long DMV counters have been accumulating. CPU and memory values give context for pressure checks. IsClustered and IsHadrEnabled show whether cluster or AG checks may apply. This query needs metadata access and may need VIEW SERVER STATE for DMV details.
SQL Server monitoring tools and where they fit
SQL Server already exposes useful monitoring data through DMVs, Query Store, Extended Events, SQL Agent history, msdb backup tables, the SQL Server error log, and Performance Monitor counters.
Performance Monitor is still useful when SQL Server and Windows counters need to be read together. Keep it focused; collecting too many counters adds overhead. Extended Events is the modern lightweight event framework for deadlocks, blocked-process reports, long queries, and targeted troubleshooting.
| Tool model | Use it for |
|---|---|
| DMVs | Current requests, waits, file latency, memory, schedulers, sessions, and AG state. |
| Query Store | Query runtime history, plans, regressions, and release or upgrade comparisons. |
| Extended Events | Deadlocks, blocked-process reports, expensive statements, login failures, and targeted event capture. |
| SQL Agent and Database Mail | Job scheduling, job history, operators, alerts, and email delivery. |
| msdb history | Backups, restores, SQL Agent job output, maintenance jobs, and some CHECKDB history if jobs log it. |
| SQL Server error log | Severity errors, startup messages, backup problems, login problems, AG messages, and failed service-level events. |
| Performance Monitor | Windows and SQL counters together: CPU, memory, disk I/O, buffer manager, and workload rate. |
| Custom SQL Agent or PowerShell checks | Targeted backup, job, DBCC, file, blocking, or capacity checks missing from the main tool. |
| SQL-focused third-party tools | Central alerting, longer history, query diagnosis, baselines, estate views, and DBA workflows. |
| APM or observability platforms | SQL metrics and query data beside application traces, logs, services, and infrastructure. |
| Azure Monitor and database watcher | Azure SQL Database and Managed Instance monitoring through supported Azure paths. |
| Recurring DBA review | Regular review of backups, jobs, alerts, changes, and SQL Server health; not a replacement for live alerting. |
Best third-party SQL Server monitoring tools
There is no single best SQL Server monitoring tool for every company. The better question is which tool catches the failures your people must act on, keeps enough history, and fits the platform you run.
Treat vendor feature lists as a shortlist, not the final decision. Trial backup alerts, SQL Agent coverage, stale DBCC detection, blocking capture, deadlocks, Query Store detail, file latency, tempdb, AG state, retention, permissions, and alert routing before trusting the tool.
Redgate Monitor
Strong fit for SQL Server and database estate monitoring, alerting, query diagnosis, deployment context, and practical DBA workflows.
Trial check: failed backups, SQL Agent failures, blocking, deadlocks, Query Store details, alert routing, and repository sizing.
Common gap: a good rollout still needs alert tuning and a clear response path for each production instance.
SolarWinds SQL Sentry
Strong fit for SQL Server and Azure SQL performance monitoring, Top SQL, blocking, deadlocks, Event Calendar, Always On, and forecasting.
Trial check: collection service placement, query capture settings, retention, deadlock and blocking capture, and alert integration.
Common gap: job, backup, DBCC, and support-process coverage still need checking against the local setup.
IDERA SQL Diagnostic Manager
Strong fit for SQL Server monitoring across on-prem, hybrid, and cloud setups with waits, blocking, deadlocks, tempdb, I/O, query, and root-cause workflows.
Trial check: permissions, repository growth, retention, wait and blocking history, tempdb, file latency, and useful alert defaults.
Common gap: noise control matters; a broad alert set can become background email if nobody tunes it.
Datadog SQL Server integration and Database Monitoring
Strong fit when Datadog already monitors applications, infrastructure, logs, and service health, and SQL Server metrics need to sit beside that data.
Trial check: agent setup, database monitoring configuration, query sampling, service correlation, cost, and retention.
Common gap: SQL Agent, backups, DBCC, restore checks, and DBA-specific runbooks may need custom checks.
New Relic Microsoft SQL monitoring
Strong fit when New Relic is already the application and infrastructure monitoring system and SQL instance metrics need to join that view.
Trial check: integration depth, query-level requirements, alert rules, retention, and incident workflow.
Common gap: DBA-specific checks may need custom collection when the main requirement is SQL operations rather than app observability.
Azure Monitor and database watcher
Strong fit for Azure SQL Database, Azure SQL Managed Instance, and Azure estates already using Azure alerts, logs, and dashboards.
Trial check: platform support, database watcher availability, data store choice, private connectivity, and alert routing.
Common gap: SQL Insights is retired, and SQL Server on Azure VM uses different monitoring options than Azure SQL Database or Managed Instance.
How to choose a SQL Server monitoring tool
Choose the tool from the failure list, not from the dashboard screenshots. The best SQL Server monitoring tool is the one that catches the failures people actually need to act on.
| Trial check | What to verify |
|---|---|
| Failed backup detection | Full, differential, and log backup age by database, including failed backup jobs and stale databases. |
| SQL Agent failure detection | Failed, retried, disabled, long-running, and no-recent-run jobs. |
| Stale CHECKDB detection | Last successful integrity check by database, failed DBCC jobs, and output retention. |
| Blocking chain capture | Head blocker, blocked sessions, wait resource, query text, host, login, and transaction age. |
| Deadlock graph capture | Deadlock XML or graph, victim, process list, resources, query text, and application context. |
| Query text and plan capture | Top SQL, plans, Query Store data, regressions, parameter patterns, and retention. |
| File latency and tempdb history | Read/write latency, file growth, log growth, tempdb space, version store, and spills. |
| AG and listener coverage | Replica role, synchronization, queue sizes, failover state, listener behavior, and replica-specific jobs. |
| Retention and repository sizing | Enough history for releases, incidents, month-end work, and normal baseline comparison. |
| Alert routing and escalation | Severity, first check, repeat behavior, suppression, responsible person, and escalation integration. |
| Permissions and collection overhead | Least-privilege collection, SQL Server 2022 permission changes, sampling cost, and repository growth. |
What to monitor in SQL Server
A useful SQL Server monitoring setup checks recovery, workload, storage, maintenance, and topology together. CPU and memory are not enough.
| Area | Monitor |
|---|---|
| Availability | Instance up, database state, application connection failures, SQL Server error log, service restarts, and recent failovers. |
| Recovery | Full, differential, and log backup age; backup failures; restore test age; restore job output; and log-chain gaps. |
| Maintenance | SQL Agent failures, long-running jobs, stale CHECKDB, index/statistics jobs, cleanup jobs, and Database Mail failures. |
| Performance | Waits, active requests, blocking, deadlocks, Query Store regressions, CPU, memory, worker pressure, and batch rate. |
| Storage | Free space, file growth, log growth, I/O latency, disk queues, backup throughput, and data/log placement. |
| tempdb | File sizes, growth, free space, internal objects, user objects, version store, spills, and contention symptoms. |
| HA and DR | AG synchronization, redo/log send queues, replica health, backup preference, failover readiness, jobs on each replica, and listener behavior. |
| Security-relevant changes | Failed login bursts, new sysadmins, changed service accounts, changed Agent proxies, and audit or compliance job failures where used. |
SQL Server monitoring alerts that need action
Alerts should be routed by production impact and first action. Do not page people for informational events they cannot change.
Each alert needs a threshold, delay, repeat rule, responsible person, first check, and escalation route. Otherwise it becomes background noise.
| Alert | First check |
|---|---|
| Backup is stale or failed | msdb backup history, job output, storage path, backup target, and recent database state. |
| SQL Agent job failed | Failed step, job owner, proxy or credential, command path, job duration, and last successful run. |
| DBCC CHECKDB is stale | Maintenance job history, database list, runtime window, output table or log file, and last known success. |
| Blocking is above threshold | Head blocker, running statement, transaction age, wait resource, login, host, and application. |
| Deadlock burst | Deadlock graph, application path, object/index names, isolation level, retry behavior, and release timing. |
| Data or log file is low on space | Growth setting, drive free space, log reuse wait, recent bulk work, and backup/log backup state. |
| AG replica is unhealthy | Synchronization state, queues, replica connection, failover mode, backup preference, and listener state. |
| High file latency | Affected files, read/write split, storage path, workload timing, backup activity, and VM/storage events. |
How to monitor SQL Server performance and blocking
Monitoring should capture waits, active requests, blockers, query text, reads, writes, CPU, and duration while the issue is happening. After the incident, the useful details are usually gone unless the tool captured them.
How to read SQL Server wait stats
Lists top cumulative waits while excluding common background waits.
wait_seconds helps identify the largest cumulative waits since startup. signal_wait_time_ms can point toward scheduler pressure. High wait totals need workload timing and baseline context. Do not reset wait stats casually in production. Recent SQL Server versions can require VIEW SERVER PERFORMANCE STATE for some performance DMVs.
How to capture current waiting tasks
Shows sessions waiting right now, including wait type, duration, blocking session, resource, command, and database.
wait_duration_ms shows which waits are active long enough to matter. blocking_session_id points to live blocking when the value is not zero. resource_description can identify locks, pages, latches, or other resources. Use this with active requests before changing server settings.
How to capture active requests and blockers
Shows current requests with blocking session, wait details, resource use, database, and running statement.
blocking_session_id identifies live blocking chains. wait_type and wait_resource show what the request is waiting on right now. logical_reads, reads, writes, and cpu_time help prioritize active work. Use the blocking guide when blocking is active and the chain needs deeper review.
How to find Query Store workload candidates
Run inside a user database with Query Store enabled to find recent high-duration or high-read queries.
This query is a candidate list, not a full before-and-after comparison. Capture a baseline window before releases, upgrades, or major maintenance. Compare the same workload window after the change: duration, CPU, reads, executions, and plans. Use Query Store output with waits, plans, and release timing. Query Store queries need database access and Query Store enabled in the user database.
How to monitor SQL Server backups, jobs, and DBCC
Monitoring that misses backups, restores, DBCC, and SQL Agent jobs is incomplete. The application can still work while the recovery path is already broken.
How to check backup monitoring coverage
Reads msdb backup history to show recent backup coverage by backup type.
last_full_backup is the base recovery input. last_log_backup matters for databases in full or bulk-logged recovery model. A missing backup date needs review before trusting the monitoring setup. Backup history does not replace restore testing. The msdb backup tables require access to msdb history.
How to check SQL Agent failures
Reads recent SQL Agent history for failed, retrying, canceled, in-progress, or long-running job steps.
run_status identifies failed, retrying, canceled, and in-progress work. duration_seconds helps catch jobs that now exceed the normal window. job_owner should be valid and intentional. Also check job steps, proxies, credentials, operators, alerts, Database Mail, CmdExec, PowerShell, SSIS, and output files.
How to find stale CHECKDB monitoring
Looks for DBCC CHECKDB mentions in SQL Agent history. Use this as a starting point, not a complete integrity-check audit.
This depends on job history text and job retention, so it can miss custom logging. A better setup stores CHECKDB output in a table or log path with alerting. Monitor both stale CHECKDB and failed CHECKDB. Use the backup guide when restore testing and CHECKDB coverage need deeper review.
How to monitor SQL Server storage, tempdb, and capacity
Free space is not enough. Storage monitoring needs file latency, growth behavior, log growth, data and log placement, backup throughput, tempdb space use, version store, internal objects, and capacity trend.
How to check database file latency
Reads database file paths, sizes, reads/writes, and average I/O stall from sys.dm_io_virtual_file_stats.
physical_name shows where data and log files live. avg_read_ms and avg_write_ms are starting points for storage review. Log files with high write latency need separate attention. Compare this with workload timing before blaming storage. File latency DMVs usually require server-level performance permissions.
How to check tempdb usage
Run in tempdb to review file sizes, free space, user objects, internal objects, version store, and growth settings.
version_store_mb matters when row versioning or long transactions are active. internal_object_mb can rise with sorts, hashes, spools, and spills. growth_setting should be predictable and large enough for production use. Use the tempdb guide when pressure keeps returning.
How to monitor SQL Server HA, DR, and failover
HA monitoring should check whether failover can actually work. Replica health, synchronization state, queues, backup preference, jobs, listener behavior, and failover drill output all matter.
How to check Availability Group monitoring inputs
Shows AG primary, replica mode, failover mode, connection state, synchronization health, and queue size.
primary_replica and role_desc show the current AG layout. availability_mode_desc and failover_mode_desc show configured behavior. synchronization_state_desc and queue sizes need attention before relying on failover. No AG rows means this check does not apply to that instance.
How to read recent SQL Server error log entries
Reads recent SQL Server error log entries containing Error:. Filter further for AG, backup, login, memory, I/O, or severity patterns.
The error log is noisy, but it catches service-level problems other queries can miss. Monitor severity errors, login bursts, AG messages, backup failures, memory errors, I/O errors, and restart messages. Use targeted filters or Extended Events for recurring patterns. Some environments restrict xp_readerrorlog; use approved logging access where required.
How to build a SQL Server monitoring baseline
A threshold only makes sense when it is compared with normal behavior. Build baseline windows before deciding what should page someone.
Capture normal business hours, batch windows, month-end or reporting work, maintenance windows, and the period after releases, storage changes, VM changes, SQL patches, or compatibility-level changes.
How to collect baseline performance counters
Reads common SQL Server counters that can seed monitoring baselines. This is a starting point, not complete monitoring.
Batch Requests/sec is workload rate, not health by itself. Page life expectancy needs local baseline context and memory size context. Some per-second counters are cumulative and need interval sampling. Use this with job schedules and business timing before setting thresholds.
| Baseline window | Why it matters |
|---|---|
| Business hours | Shows normal user workload, concurrency, waits, and request volume. |
| Batch windows | Separates expected job pressure from unexpected production pressure. |
| Month-end or reporting | Prevents heavy but expected workload from looking like an incident. |
| Maintenance windows | Keeps backup, CHECKDB, index, statistics, and cleanup pressure separate from user workload. |
| After releases or changes | Helps catch changed query shape, job timing, resource use, or alert noise. |
What not to do with SQL Server monitoring
Do not monitor only CPU, memory, and disk free space.
Do not buy a monitoring tool and leave alerts without a response path.
Do not alert on every transient wait or one-off CPU spike.
Do not ignore backups, jobs, DBCC, and error logs because the application is online.
Do not copy thresholds from another server without checking the local workload.
Do not give monitoring logins more permissions than they need.
Do not rely on screenshots when history and retention are needed.
Do not treat a green dashboard as restore, failover, or performance validation.
When to request a SQL Server monitoring review
Get help when alerts are noisy, failures are missed, incidents have too little history, or nobody can say which SQL Server checks matter most.
A one-time health audit can check the current setup. Monthly SQL Server DBA support makes more sense when backups, jobs, alerts, changes, and monitoring need regular review.
Send the SQL Server version, instance count, current monitoring tool, alert list, backup history, job history, DBCC history, wait stats, blocking or deadlock examples, Query Store output, storage and tempdb data, and AG or failover status where relevant.
Next step
Use the SQL Server health audit when monitoring is one part of a wider SQL Server check.
Use monthly SQL Server DBA support when backups, jobs, alerts, performance checks, and planned changes need regular attention.
Next useful reads: the SQL Server backup guide, the SQL Server blocking guide, the SQL Server waits guide, the SQL Server failover guide, and the SQL Server tempdb guide.
