sql server hub / monitoring guide

SQL Server monitoring guide

SQL Server monitoring should catch the problems that matter before a user report is the first alert.

A useful setup checks backups, jobs, DBCC, blocking, deadlocks, waits, file latency, tempdb, Query Store, HA and DR state, and the alert route people actually use.

Guide

Guide~14 min readUpdated 9 Jul 2026

LinkedIn X Email

01What SQL Server monitoring should catch
02How to build SQL Server monitoring coverage
03How to set up SQL Server monitoring without expensive tools
04SQL Server monitoring tools and where they fit
05Best third-party SQL Server monitoring tools
06How to choose a SQL Server monitoring tool
07What to monitor in SQL Server
08SQL Server monitoring alerts that need action
09SQL Server error log monitoring
10How to monitor SQL Server performance and blocking
11How to monitor SQL Server backups, jobs, and DBCC
12How to monitor SQL Server storage, tempdb, and capacity
13How to monitor SQL Server HA, DR, and failover
14How to build a SQL Server monitoring baseline
15What not to do with SQL Server monitoring
16When to request a SQL Server monitoring review

What SQL Server monitoring should catch

Uptime is only one check. A SQL Server can accept connections while backups are stale, DBCC has not run, jobs are failing, tempdb is filling, and blocking is already hurting users.

Good monitoring watches the work SQL Server depends on, not only whether the service is online. It should notice when recovery work is missing, when scheduled work is failing, when users are waiting, when storage is becoming the bottleneck, and when HA or DR state has changed.

Monitoring is not a dashboard by itself. It needs collection, retention, thresholds, alert routing, first checks, and someone responsible for response. The useful question is not just whether there is a graph, but whether the graph, alert, or history would help someone act before the problem becomes a user report.

Monitoring area	What it should catch	Why it matters
Recovery	Missed backups, broken log backup patterns, stale restore testing, and missing backup alerts.	Backup history is not useful if nobody notices gaps until a restore is needed.
Maintenance	Failed jobs, stale CHECKDB, cleanup failures, long-running maintenance, and schedule overlap.	SQL Agent can look quiet while important work has stopped or drifted into business hours.
Workload	Blocking chains, deadlocks, high waits, slow queries, plan regressions, and Query Store changes.	Performance issues need timing, query text, plans, and session details, not only average CPU graphs.
Storage	File latency, free space, log growth, tempdb pressure, version-store growth, and abnormal file growth.	Storage pressure can appear as slow queries, blocking, failed backups, or database growth incidents.
Availability	AG replica health, synchronization state, listener problems, failover events, and DR jobs that stopped running.	HA and DR only help when the monitoring system sees role changes and broken dependencies.
Response	Alerts that are routed, acknowledged, repeated when needed, and tied to a first-check note.	A correct alert still fails if it goes to an abandoned mailbox or gives nobody enough context to start.

How to build SQL Server monitoring coverage

Build the monitoring setup from production failures backward. Decide which failures must be caught, which SQL Server data source can show them, how often to collect it, how long to keep it, and who gets the alert.

The order matters: critical failures first, data sources second, collection method third, then retention, thresholds, alert routing, response notes, and a regular review cycle.

Build step	What to decide	Why it matters
Failure list	Backups, jobs, DBCC, blocking, deadlocks, storage, tempdb, Query Store, HA, and error-log events.	Monitoring should catch failures that affect recovery, performance, or support work.
Data source	DMV, Query Store, Extended Events, SQL Agent history, msdb, error log, Performance Monitor, or vendor collector.	The source controls what can be trusted, retained, and alerted on.
Collection	Polling interval, collector location, permissions, repository, and retention.	A one-time query is useful, but it is not monitoring.
Alert rule	Threshold, delay, repeat behavior, severity, and suppression rules.	Bad thresholds create noise; weak thresholds miss incidents.
Response	First check, escalation path, and the person responsible for action.	An alert without a response path becomes another ignored email.
Review cycle	Monthly or quarterly alert review, false-positive cleanup, and missed-failure review.	The setup needs tuning as databases, jobs, workload, and HA topology change.

How to check SQL Server instance context

Reads server identity, edition, startup time, CPU, memory, clustered state, and HADR state.

SELECT
    SERVERPROPERTY('ServerName') AS server_name,
    SERVERPROPERTY('MachineName') AS machine_name,
    SERVERPROPERTY('Edition') AS edition,
    SERVERPROPERTY('ProductVersion') AS product_version,
    SERVERPROPERTY('ProductLevel') AS product_level,
    SERVERPROPERTY('ProductUpdateLevel') AS product_update_level,
    SERVERPROPERTY('IsClustered') AS is_clustered,
    SERVERPROPERTY('IsHadrEnabled') AS is_hadr_enabled,
    si.sqlserver_start_time,
    si.cpu_count,
    si.scheduler_count,
    si.physical_memory_kb / 1024 AS physical_memory_mb,
    si.committed_kb / 1024 AS sql_committed_mb,
    si.committed_target_kb / 1024 AS sql_target_mb
FROM sys.dm_os_sys_info AS si;

sqlserver_start_time tells you how long DMV counters have been accumulating. CPU and memory values give context for pressure checks. IsClustered and IsHadrEnabled show whether cluster or AG checks may apply. This query needs metadata access and may need VIEW SERVER STATE for DMV details.

How to set up SQL Server monitoring without expensive tools

You can set up useful SQL Server monitoring without buying a large monitoring platform first. Start with the checks SQL Server already gives you: SQL Agent history, Database Mail, operators, alerts, backup history, the error log, Query Store, Extended Events, and Performance Monitor.

The tradeoff is maintenance. A low-cost setup still needs alert routing, history retention, thresholds, a simple response note, and someone responsible for checking whether alerts still work.

This approach fits small environments where the first goal is to catch missed backups, failed SQL Agent jobs, serious errors, blocking, deadlocks, storage pressure, and query regressions. If the number of instances grows, a SQL-focused monitoring tool can become cheaper than maintaining custom checks by hand.

Input	Use it for	Watch out
SQL Agent history	Failed jobs, long-running jobs, retried jobs, canceled steps, job owners, and recent run patterns.	History retention must be long enough to investigate missed failures.
Database Mail, operators, and Agent alerts	Job failure emails, severity alerts, performance alerts, and the people or groups receiving them.	Mail setup and operator addresses need testing; an alert that cannot send is decoration.
msdb backup history	Last full, differential, and log backup by database, plus gaps in expected backup chains.	Backup history shows job output, but restore testing still needs its own check.
SQL Server error log and Windows Application log	Severity errors, login bursts, I/O messages, service restarts, AG messages, and backup errors.	Logs are noisy, so filters and a review rhythm matter.
Query Store	Query runtime history, plan changes, regressions, and release comparisons in each database where it is enabled.	It needs database-level setup, retention decisions, and review before changes.
Extended Events	Deadlocks, blocked-process reports, login failures, and targeted query or error capture.	Keep sessions focused and test collection overhead before broad rollout.
Performance Monitor	CPU, memory, disk, SQL Server counters, and simple threshold alerts.	Counters need sampling and local baselines; one snapshot rarely explains an incident.
Existing infrastructure monitoring	Service status, disk free space, VM health, storage latency symptoms, and basic host alerts.	It still needs SQL-specific checks for jobs, backups, waits, blocking, and Query Store.

SQL Server monitoring tools and where they fit

SQL Server already exposes useful monitoring data through DMVs, Query Store, Extended Events, SQL Agent history, msdb backup tables, the SQL Server error log, and Performance Monitor counters.

Performance Monitor is still useful when SQL Server and Windows counters need to be read together. Keep it focused; collecting too many counters adds overhead. Extended Events is the modern lightweight event framework for deadlocks, blocked-process reports, long queries, and targeted troubleshooting.

Tool model	Use it for
DMVs	Current requests, waits, file latency, memory, schedulers, sessions, and AG state.
Query Store	Query runtime history, plans, regressions, and release or upgrade comparisons.
Extended Events	Deadlocks, blocked-process reports, expensive statements, login failures, and targeted event capture.
SQL Agent and Database Mail	Job scheduling, job history, operators, alerts, and email delivery.
msdb history	Backups, restores, SQL Agent job output, maintenance jobs, and some CHECKDB history if jobs log it.
SQL Server error log	Severity errors, startup messages, backup problems, login problems, AG messages, and failed service-level events.
Performance Monitor	Windows and SQL counters together: CPU, memory, disk I/O, buffer manager, and workload rate.
Custom SQL Agent or PowerShell checks	Targeted backup, job, DBCC, file, blocking, or capacity checks missing from the main tool.
SQL-focused third-party tools	Central alerting, longer history, query diagnosis, baselines, estate views, and DBA workflows.
APM or observability platforms	SQL metrics and query data beside application traces, logs, services, and infrastructure.
Azure Monitor and database watcher	Azure SQL Database and Managed Instance monitoring through supported Azure paths.
Recurring DBA review	Regular review of backups, jobs, alerts, changes, and SQL Server health; not a replacement for live alerting.

Best third-party SQL Server monitoring tools

There is no single best SQL Server monitoring tool for every company. The better question is which tool catches the failures your people must act on, keeps enough history, and fits the platform you run.

Treat vendor feature lists as a shortlist, not the final decision. Trial backup alerts, SQL Agent coverage, stale DBCC detection, blocking capture, deadlocks, Query Store detail, file latency, tempdb, AG state, retention, permissions, and alert routing before trusting the tool.

For a deeper tool-by-tool comparison, read the 10 best SQL Server monitoring tools article.

Redgate Monitor

Strong fit for SQL Server and database estate monitoring, alerting, query diagnosis, deployment context, and practical DBA workflows.

Trial check: failed backups, SQL Agent failures, blocking, deadlocks, Query Store details, alert routing, and repository sizing.

Common gap: a good rollout still needs alert tuning and a clear response path for each production instance.

SolarWinds SQL Sentry

Strong fit for SQL Server and Azure SQL performance monitoring, Top SQL, blocking, deadlocks, Event Calendar, Always On, and forecasting.

Trial check: collection service placement, query capture settings, retention, deadlock and blocking capture, and alert integration.

Common gap: job, backup, DBCC, and support-process coverage still need checking against the local setup.

IDERA SQL Diagnostic Manager

Strong fit for SQL Server monitoring across on-prem, hybrid, and cloud setups with waits, blocking, deadlocks, tempdb, I/O, query, and root-cause workflows.

Trial check: permissions, repository growth, retention, wait and blocking history, tempdb, file latency, and useful alert defaults.

Common gap: noise control matters; a broad alert set can become background email if nobody tunes it.

Datadog SQL Server integration and Database Monitoring

Strong fit when Datadog already monitors applications, infrastructure, logs, and service health, and SQL Server metrics need to sit beside that data.

Trial check: agent setup, database monitoring configuration, query sampling, service correlation, cost, and retention.

Common gap: SQL Agent, backups, DBCC, restore checks, and DBA-specific runbooks may need custom checks.

New Relic Microsoft SQL monitoring

Strong fit when New Relic is already the application and infrastructure monitoring system and SQL instance metrics need to join that view.

Trial check: integration depth, query-level requirements, alert rules, retention, and incident workflow.

Common gap: DBA-specific checks may need custom collection when the main requirement is SQL operations rather than app observability.

Azure Monitor and database watcher

Strong fit for Azure SQL Database, Azure SQL Managed Instance, and Azure estates already using Azure alerts, logs, and dashboards.

Trial check: platform support, database watcher availability, data store choice, private connectivity, and alert routing.

Common gap: SQL Insights is retired, and SQL Server on Azure VM uses different monitoring options than Azure SQL Database or Managed Instance.

How to choose a SQL Server monitoring tool

Choose the tool from the failure list, not from the dashboard screenshots. The best SQL Server monitoring tool is the one that catches the failures people actually need to act on.

Trial check	What to verify
Failed backup detection	Full, differential, and log backup age by database, including failed backup jobs and stale databases.
SQL Agent failure detection	Failed, retried, disabled, long-running, and no-recent-run jobs.
Stale CHECKDB detection	Last successful integrity check by database, failed DBCC jobs, and output retention.
Blocking chain capture	Head blocker, blocked sessions, wait resource, query text, host, login, and transaction age.
Deadlock graph capture	Deadlock XML or graph, victim, process list, resources, query text, and application context.
Query text and plan capture	Top SQL, plans, Query Store data, regressions, parameter patterns, and retention.
File latency and tempdb history	Read/write latency, file growth, log growth, tempdb space, version store, and spills.
AG and listener coverage	Replica role, synchronization, queue sizes, failover state, listener behavior, and replica-specific jobs.
Retention and repository sizing	Enough history for releases, incidents, month-end work, and normal baseline comparison.
Alert routing and escalation	Severity, first check, repeat behavior, suppression, responsible person, and escalation integration.
Permissions and collection overhead	Least-privilege collection, SQL Server 2022 permission changes, sampling cost, and repository growth.

What to monitor in SQL Server

A useful SQL Server monitoring setup checks recovery, workload, storage, maintenance, and topology together. CPU and memory are not enough.

Area	Monitor
Availability	Instance up, database state, application connection failures, SQL Server error log, service restarts, and recent failovers.
Recovery	Full, differential, and log backup age; backup failures; restore test age; restore job output; and log-chain gaps.
Maintenance	SQL Agent failures, long-running jobs, stale CHECKDB, index/statistics jobs, cleanup jobs, and Database Mail failures.
Performance	Waits, active requests, blocking, deadlocks, Query Store regressions, CPU, memory, worker pressure, and batch rate.
Storage	Free space, file growth, log growth, I/O latency, disk queues, backup throughput, and data/log placement.
tempdb	File sizes, growth, free space, internal objects, user objects, version store, spills, and contention symptoms.
HA and DR	AG synchronization, redo/log send queues, replica health, backup preference, failover readiness, jobs on each replica, and listener behavior.
Security-relevant changes	Failed login bursts, new sysadmins, changed service accounts, changed Agent proxies, and audit or compliance job failures where used.

SQL Server monitoring alerts that need action

Alerts should be routed by production impact and first action. Do not page people for informational events they cannot change.

Each alert needs a threshold, delay, repeat rule, responsible person, first check, and escalation route. Otherwise it becomes background noise.

Alert	First check
Backup is stale or failed	msdb backup history, job output, storage path, backup target, and recent database state.
SQL Agent job failed	Failed step, job owner, proxy or credential, command path, job duration, and last successful run.
DBCC CHECKDB is stale	Maintenance job history, database list, runtime window, output table or log file, and last known success.
Blocking is above threshold	Head blocker, running statement, transaction age, wait resource, login, host, and application.
Deadlock burst	Deadlock graph, application path, object/index names, isolation level, retry behavior, and release timing.
Data or log file is low on space	Growth setting, drive free space, log reuse wait, recent bulk work, and backup/log backup state.
AG replica is unhealthy	Synchronization state, queues, replica connection, failover mode, backup preference, and listener state.
High file latency	Affected files, read/write split, storage path, workload timing, backup activity, and VM/storage events.

SQL Server error log monitoring

SQL Server error log monitoring should catch serious engine messages, failed service starts, I/O warnings, corruption messages, login bursts, backup and restore errors, Always On state changes, and repeated restart patterns.

The log is noisy, so monitor it with filters and a review rhythm. Treat new severity 17-25 errors, 823/824/825 I/O messages, failed recovery messages, login failures, and AG state changes as reasons to check the surrounding jobs, storage, backups, and Windows Application log.

Log pattern	First check
Severity 17-25 errors	Error number, database, time window, recent workload, storage state, and whether the error repeats.
823, 824, or 825 messages	Storage path, Windows logs, backups, CHECKDB output, and vendor/storage events.
Recovery or restore messages	Database state, restore sequence, startup timing, crash recovery, and recent failover or restart.
Login failures	Source host, login name, application, password changes, failed-login bursts, and security review path.
Always On messages	Replica role, synchronization health, listener state, cluster log, failover timing, and AG alerts.

How to sample SQL Server error log errors

Reads recent SQL Server error log entries containing Error: from the current error log.

EXEC sys.xp_readerrorlog 0, 1, N'Error:', NULL, NULL, NULL, N'desc';

xp_readerrorlog is useful for targeted review, but the output needs filtering. Search for specific error numbers such as 823, 824, 825, 9002, 1205, or AG messages when the symptom is known. Pair SQL Server error log review with Windows Application and System logs for service, disk, and cluster events. Do not build alerting from one broad text search without noise tuning.

How to monitor SQL Server performance and blocking

Monitoring should capture waits, active requests, blockers, query text, reads, writes, CPU, and duration while the issue is happening. After the incident, the useful details are usually gone unless the tool captured them.

How to read SQL Server wait stats

Lists top cumulative waits while excluding common background waits.

SELECT TOP (25)
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    signal_wait_time_ms,
    wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms,
    CONVERT(decimal(18,2), wait_time_ms / 1000.0) AS wait_seconds
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
    N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP',
    N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE',
    N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'LAZYWRITER_SLEEP',
    N'LOGMGR_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE',
    N'SERVER_IDLE_CHECK', N'SLEEP_TASK', N'SLEEP_SYSTEMTASK',
    N'SQLTRACE_BUFFER_FLUSH', N'WAITFOR', N'XE_DISPATCHER_WAIT',
    N'XE_TIMER_EVENT'
)
ORDER BY wait_time_ms DESC;

wait_seconds helps identify the largest cumulative waits since startup. signal_wait_time_ms can point toward scheduler pressure. High wait totals need workload timing and baseline context. Do not reset wait stats casually in production. Recent SQL Server versions can require VIEW SERVER PERFORMANCE STATE for some performance DMVs.

How to capture current waiting tasks

Shows sessions waiting right now, including wait type, duration, blocking session, resource, command, and database.

SELECT TOP (50)
    wt.session_id,
    wt.exec_context_id,
    wt.wait_type,
    wt.wait_duration_ms,
    wt.blocking_session_id,
    wt.resource_description,
    r.status,
    r.command,
    DB_NAME(r.database_id) AS database_name,
    r.cpu_time,
    r.logical_reads,
    r.reads,
    r.writes
FROM sys.dm_os_waiting_tasks AS wt
LEFT JOIN sys.dm_exec_requests AS r
    ON r.session_id = wt.session_id
ORDER BY wt.wait_duration_ms DESC;

wait_duration_ms shows which waits are active long enough to matter. blocking_session_id points to live blocking when the value is not zero. resource_description can identify locks, pages, latches, or other resources. Use this with active requests before changing server settings.

How to capture active requests and blockers

Shows current requests with blocking session, wait details, resource use, database, and running statement.

SELECT
    r.session_id,
    s.host_name,
    s.program_name,
    s.login_name,
    DB_NAME(r.database_id) AS database_name,
    r.status,
    r.command,
    r.blocking_session_id,
    r.wait_type,
    r.wait_time,
    r.wait_resource,
    r.cpu_time,
    r.logical_reads,
    r.reads,
    r.writes,
    SUBSTRING(
        t.text,
        (r.statement_start_offset / 2) + 1,
        (
            (CASE r.statement_end_offset
                WHEN -1 THEN DATALENGTH(t.text)
                ELSE r.statement_end_offset
             END - r.statement_start_offset) / 2
        ) + 1
    ) AS running_statement
FROM sys.dm_exec_requests AS r
JOIN sys.dm_exec_sessions AS s
    ON s.session_id = r.session_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID
ORDER BY r.cpu_time DESC, r.logical_reads DESC;

blocking_session_id identifies live blocking chains. wait_type and wait_resource show what the request is waiting on right now. logical_reads, reads, writes, and cpu_time help prioritize active work. Use the blocking guide when blocking is active and the chain needs deeper review.

How to find Query Store workload candidates

Run inside a user database with Query Store enabled to find recent high-duration or high-read queries.

SELECT TOP (25)
    OBJECT_NAME(q.object_id) AS object_name,
    qt.query_sql_text,
    p.plan_id,
    SUM(rs.count_executions) AS executions,
    CONVERT(decimal(18,2), SUM(rs.avg_duration * rs.count_executions) / NULLIF(SUM(rs.count_executions), 0) / 1000.0) AS avg_duration_ms,
    CONVERT(decimal(18,2), SUM(rs.avg_cpu_time * rs.count_executions) / NULLIF(SUM(rs.count_executions), 0) / 1000.0) AS avg_cpu_ms,
    CONVERT(decimal(18,2), SUM(rs.avg_logical_io_reads * rs.count_executions) / NULLIF(SUM(rs.count_executions), 0)) AS avg_logical_reads,
    MAX(rsi.end_time) AS last_interval_end
FROM sys.query_store_runtime_stats AS rs
JOIN sys.query_store_runtime_stats_interval AS rsi
    ON rsi.runtime_stats_interval_id = rs.runtime_stats_interval_id
JOIN sys.query_store_plan AS p
    ON p.plan_id = rs.plan_id
JOIN sys.query_store_query AS q
    ON q.query_id = p.query_id
JOIN sys.query_store_query_text AS qt
    ON qt.query_text_id = q.query_text_id
WHERE rsi.start_time >= DATEADD(day, -7, SYSUTCDATETIME())
GROUP BY OBJECT_NAME(q.object_id), qt.query_sql_text, p.plan_id
ORDER BY avg_duration_ms DESC, avg_logical_reads DESC;

This query is a candidate list, not a full before-and-after comparison. Capture a baseline window before releases, upgrades, or major maintenance. Compare the same workload window after the change: duration, CPU, reads, executions, and plans. Use Query Store output with waits, plans, and release timing. Query Store queries need database access and Query Store enabled in the user database.

How to monitor SQL Server backups, jobs, and DBCC

Monitoring that misses backups, restores, DBCC, and SQL Agent jobs is incomplete. The application can still work while the recovery path is already broken or scheduled work has stopped.

SQL Agent job monitoring should catch more than a red failed-job icon. It should show failed steps, retries, long runs, disabled schedules, missed expected runs, job history retention, notification settings, and whether Database Mail and operators can actually reach someone.

SQL Agent check	What to verify
Failed jobs	Check failed job steps, step output, error message, owner, proxy, credential, command path, and last successful run.
Long-running jobs	Compare current duration with normal duration for the same schedule window before treating it as an incident.
Disabled jobs or schedules	Review disabled jobs, disabled schedules, job owners, and whether the disabled state is documented.
Missed expected runs	Compare schedules with recent job history so a job that never started does not look healthy.
Job history retention	Keep enough msdb job history to review incidents, month-end work, and recurring failures.
Notifications	Test Database Mail, operators, job notification settings, SQL Agent alert system settings, and the mailbox people actually read.
Output detail	Store job-step output where possible so the alert leads to the failing command, file path, package, or stored procedure.
Response path	Write down who checks the job first, what they check, and when the problem escalates.

How to check backup monitoring coverage

Reads msdb backup history to show recent backup coverage by backup type.

WITH LastBackups AS (
    SELECT
        bs.database_name,
        bs.type,
        MAX(bs.backup_finish_date) AS last_backup_finish_date
    FROM msdb.dbo.backupset AS bs
    WHERE bs.type IN ('D', 'I', 'L')
    GROUP BY bs.database_name, bs.type
)
SELECT
    d.name AS database_name,
    d.recovery_model_desc,
    d.state_desc,
    MAX(CASE WHEN lb.type = 'D' THEN lb.last_backup_finish_date END) AS last_full_backup,
    MAX(CASE WHEN lb.type = 'I' THEN lb.last_backup_finish_date END) AS last_differential_backup,
    MAX(CASE WHEN lb.type = 'L' THEN lb.last_backup_finish_date END) AS last_log_backup
FROM sys.databases AS d
LEFT JOIN LastBackups AS lb
    ON lb.database_name = d.name
WHERE d.name <> N'tempdb'
GROUP BY d.name, d.recovery_model_desc, d.state_desc
ORDER BY d.name;

last_full_backup is the base recovery input. last_log_backup matters for databases in full or bulk-logged recovery model. A missing backup date needs review before trusting the monitoring setup. Backup history does not replace restore testing. The msdb backup tables require access to msdb history.

How to check SQL Agent failures

Reads recent SQL Agent history for failed, retrying, canceled, in-progress, or long-running job steps.

SELECT TOP (100)
    j.name AS job_name,
    SUSER_SNAME(j.owner_sid) AS job_owner,
    h.step_id,
    h.step_name,
    CASE h.run_status
        WHEN 0 THEN 'Failed'
        WHEN 1 THEN 'Succeeded'
        WHEN 2 THEN 'Retry'
        WHEN 3 THEN 'Canceled'
        WHEN 4 THEN 'In progress'
    END AS run_status,
    msdb.dbo.agent_datetime(h.run_date, h.run_time) AS run_datetime,
    ((h.run_duration / 10000) * 3600)
        + (((h.run_duration % 10000) / 100) * 60)
        + (h.run_duration % 100) AS duration_seconds,
    h.message
FROM msdb.dbo.sysjobhistory AS h
JOIN msdb.dbo.sysjobs AS j
    ON j.job_id = h.job_id
WHERE h.run_status <> 1
   OR h.run_duration >= 010000
ORDER BY run_datetime DESC;

run_status identifies failed, retrying, canceled, and in-progress work. duration_seconds helps catch jobs that now exceed the normal window. job_owner should be valid and intentional. Also check job steps, proxies, credentials, operators, alerts, Database Mail, CmdExec, PowerShell, SSIS, and output files.

How to find stale CHECKDB monitoring

Looks for DBCC CHECKDB mentions in SQL Agent history. Use this as a starting point, not a complete integrity-check audit.

SELECT
    d.name AS database_name,
    d.state_desc,
    d.user_access_desc,
    d.recovery_model_desc,
    MAX(CASE
        WHEN h.step_name LIKE N'%CHECKDB%'
          OR h.message LIKE N'%DBCC CHECKDB%'
        THEN msdb.dbo.agent_datetime(h.run_date, h.run_time)
    END) AS last_checkdb_seen_in_job_history
FROM sys.databases AS d
LEFT JOIN msdb.dbo.sysjobhistory AS h
    ON h.message LIKE N'%' + d.name + N'%'
WHERE d.name <> N'tempdb'
GROUP BY d.name, d.state_desc, d.user_access_desc, d.recovery_model_desc
ORDER BY d.name;

This depends on job history text and job retention, so it can miss custom logging. A better setup stores CHECKDB output in a table or log path with alerting. Monitor both stale CHECKDB and failed CHECKDB. Use the backup guide when restore testing and CHECKDB coverage need deeper review.

How to monitor SQL Server storage, tempdb, and capacity

Free space is not enough. Storage monitoring needs file latency, growth behavior, log growth, data and log placement, backup throughput, tempdb space use, version store, internal objects, and capacity trend.

How to check database file latency

Reads database file paths, sizes, reads/writes, and average I/O stall from sys.dm_io_virtual_file_stats.

SELECT
    DB_NAME(mf.database_id) AS database_name,
    mf.type_desc,
    mf.name AS logical_file_name,
    mf.physical_name,
    CONVERT(decimal(18,2), mf.size * 8.0 / 1024.0) AS size_mb,
    vfs.num_of_reads,
    vfs.num_of_writes,
    CONVERT(decimal(18,2), vfs.io_stall_read_ms * 1.0 / NULLIF(vfs.num_of_reads, 0)) AS avg_read_ms,
    CONVERT(decimal(18,2), vfs.io_stall_write_ms * 1.0 / NULLIF(vfs.num_of_writes, 0)) AS avg_write_ms,
    CONVERT(decimal(18,2), vfs.io_stall * 1.0 / NULLIF(vfs.num_of_reads + vfs.num_of_writes, 0)) AS avg_io_ms
FROM sys.master_files AS mf
CROSS APPLY sys.dm_io_virtual_file_stats(mf.database_id, mf.file_id) AS vfs
ORDER BY avg_io_ms DESC, database_name, mf.file_id;

physical_name shows where data and log files live. avg_read_ms and avg_write_ms are starting points for storage review. Log files with high write latency need separate attention. Compare this with workload timing before blaming storage. File latency DMVs usually require server-level performance permissions.

How to check tempdb usage

Run in tempdb to review file sizes, free space, user objects, internal objects, version store, and growth settings.

USE tempdb;

SELECT
    df.file_id,
    df.name AS logical_file_name,
    df.type_desc,
    df.physical_name,
    CONVERT(decimal(18,2), df.size * 8.0 / 1024.0) AS file_size_mb,
    CONVERT(decimal(18,2), fsu.unallocated_extent_page_count * 8.0 / 1024.0) AS free_space_mb,
    CONVERT(decimal(18,2), fsu.user_object_reserved_page_count * 8.0 / 1024.0) AS user_object_mb,
    CONVERT(decimal(18,2), fsu.internal_object_reserved_page_count * 8.0 / 1024.0) AS internal_object_mb,
    CONVERT(decimal(18,2), fsu.version_store_reserved_page_count * 8.0 / 1024.0) AS version_store_mb,
    CASE
        WHEN df.is_percent_growth = 1 THEN CONCAT(df.growth, N'%')
        ELSE CONCAT(CONVERT(decimal(18,2), df.growth * 8.0 / 1024.0), N' MB')
    END AS growth_setting
FROM tempdb.sys.database_files AS df
LEFT JOIN sys.dm_db_file_space_usage AS fsu
    ON fsu.file_id = df.file_id
ORDER BY df.file_id;

version_store_mb matters when row versioning or long transactions are active. internal_object_mb can rise with sorts, hashes, spools, and spills. growth_setting should be predictable and large enough for production use. Use the tempdb guide when pressure keeps returning.

How to monitor SQL Server HA, DR, and failover

HA monitoring should check whether failover can actually work. Replica health, synchronization state, queues, backup preference, jobs, listener behavior, and failover drill output all matter.

How to check Availability Group monitoring inputs

Shows AG primary, replica mode, failover mode, connection state, synchronization health, and queue size.

SELECT
    ag.name AS availability_group_name,
    ags.primary_replica,
    ar.replica_server_name,
    ars.role_desc,
    ar.availability_mode_desc,
    ar.failover_mode_desc,
    ars.connected_state_desc,
    ars.synchronization_health_desc,
    drs.database_id,
    DB_NAME(drs.database_id) AS database_name,
    drs.synchronization_state_desc,
    drs.synchronization_health_desc AS database_synchronization_health,
    drs.log_send_queue_size,
    drs.redo_queue_size
FROM sys.availability_groups AS ag
JOIN sys.dm_hadr_availability_group_states AS ags
    ON ags.group_id = ag.group_id
JOIN sys.availability_replicas AS ar
    ON ar.group_id = ag.group_id
LEFT JOIN sys.dm_hadr_availability_replica_states AS ars
    ON ars.replica_id = ar.replica_id
LEFT JOIN sys.dm_hadr_database_replica_states AS drs
    ON drs.replica_id = ar.replica_id
ORDER BY ag.name, ar.replica_server_name, database_name;

primary_replica and role_desc show the current AG layout. availability_mode_desc and failover_mode_desc show configured behavior. synchronization_state_desc and queue sizes need attention before relying on failover. No AG rows means this check does not apply to that instance.

How to read recent SQL Server error log entries

Reads recent SQL Server error log entries containing Error:. Filter further for AG, backup, login, memory, I/O, or severity patterns.

EXEC sys.xp_readerrorlog 0, 1, N'Error:', NULL, NULL, NULL, N'desc';

The error log is noisy, but it catches service-level problems other queries can miss. Monitor severity errors, login bursts, AG messages, backup failures, memory errors, I/O errors, and restart messages. Use targeted filters or Extended Events for recurring patterns. Some environments restrict xp_readerrorlog; use approved logging access where required.

How to build a SQL Server monitoring baseline

A threshold only makes sense when it is compared with normal behavior. Build baseline windows before deciding what should page someone.

Capture normal business hours, batch windows, month-end or reporting work, maintenance windows, and the period after releases, storage changes, VM changes, SQL patches, or compatibility-level changes.

How to collect baseline performance counters

Reads common SQL Server counters that can seed monitoring baselines. This is a starting point, not complete monitoring.

SELECT
    object_name,
    counter_name,
    instance_name,
    cntr_value,
    cntr_type
FROM sys.dm_os_performance_counters
WHERE counter_name IN (
    N'Batch Requests/sec',
    N'Page life expectancy',
    N'Lazy writes/sec',
    N'Checkpoint pages/sec',
    N'Log Flushes/sec',
    N'Log Flush Wait Time',
    N'User Connections',
    N'Processes blocked'
)
ORDER BY object_name, counter_name, instance_name;

Batch Requests/sec is workload rate, not health by itself. Page life expectancy needs local baseline context and memory size context. Some per-second counters are cumulative and need interval sampling. Use this with job schedules and business timing before setting thresholds.

Baseline window	Why it matters
Business hours	Shows normal user workload, concurrency, waits, and request volume.
Batch windows	Separates expected job pressure from unexpected production pressure.
Month-end or reporting	Prevents heavy but expected workload from looking like an incident.
Maintenance windows	Keeps backup, CHECKDB, index, statistics, and cleanup pressure separate from user workload.
After releases or changes	Helps catch changed query shape, job timing, resource use, or alert noise.

What not to do with SQL Server monitoring

Do not monitor only CPU, memory, and disk free space.

Do not buy a monitoring tool and leave alerts without a response path.

Do not alert on every transient wait or one-off CPU spike.

Do not ignore backups, jobs, DBCC, and error logs because the application is online.

Do not copy thresholds from another server without checking the local workload.

Do not give monitoring logins more permissions than they need.

Do not rely on screenshots when history and retention are needed.

Do not treat a green dashboard as restore, failover, or performance validation.

When to request a SQL Server monitoring review

A monitoring review makes sense when alerts are noisy, failures are missed, or incidents do not leave enough history to explain what happened.

Start with the SQL Server version, instance count, current monitoring tool, and the failures or alerts that made you question the setup. That is enough for a first look.

Request monitoring review

Next step

Use the SQL Server health audit when monitoring is one part of a wider SQL Server check.

Use monthly SQL Server DBA support when backups, jobs, alerts, performance checks, and planned changes need regular attention.

Next useful reads: the SQL Server backup tools comparison, the SSMS alternatives comparison, the SQL Server backup guide, the SQL Server blocking guide, the SQL Server waits guide, the SQL Server failover guide, and the SQL Server tempdb guide.