sql server hub / monitoring guide

SQL Servermonitoring guide

SQL Server monitoring should catch the problems that matter before a user report is the first alert.

A useful setup checks backups, jobs, DBCC, blocking, deadlocks, waits, file latency, tempdb, Query Store, HA and DR state, and the alert route people actually use.

Guide

Guide~11 min readUpdated 19 Apr 2026

Share

LinkedInXEmail
  1. 01What SQL Server monitoring should catch
  2. 02How to build SQL Server monitoring coverage
  3. 03SQL Server monitoring tools and where they fit
  4. 04Best third-party SQL Server monitoring tools
  5. 05How to choose a SQL Server monitoring tool
  6. 06What to monitor in SQL Server
  7. 07SQL Server monitoring alerts that need action
  8. 08How to monitor SQL Server performance and blocking
  9. 09How to monitor SQL Server backups, jobs, and DBCC
  10. 10How to monitor SQL Server storage, tempdb, and capacity
  11. 11How to monitor SQL Server HA, DR, and failover
  12. 12How to build a SQL Server monitoring baseline
  13. 13What not to do with SQL Server monitoring
  14. 14When to request a SQL Server monitoring review

What SQL Server monitoring should catch

Uptime is only one check. A SQL Server can accept connections while backups are stale, DBCC has not run, jobs are failing, tempdb is filling, and blocking is already hurting users.

Good monitoring watches the work SQL Server depends on, not only whether the service is online. It should notice when recovery work is missing, when scheduled work is failing, when users are waiting, when storage is becoming the bottleneck, and when HA or DR state has changed.

Monitoring is not a dashboard by itself. It needs collection, retention, thresholds, alert routing, first checks, and someone responsible for response. The useful question is not just whether there is a graph, but whether the graph, alert, or history would help someone act before the problem becomes a user report.

Monitoring areaWhat it should catchWhy it matters
RecoveryMissed backups, broken log backup patterns, stale restore testing, and missing backup alerts.Backup history is not useful if nobody notices gaps until a restore is needed.
MaintenanceFailed jobs, stale CHECKDB, cleanup failures, long-running maintenance, and schedule overlap.SQL Agent can look quiet while important work has stopped or drifted into business hours.
WorkloadBlocking chains, deadlocks, high waits, slow queries, plan regressions, and Query Store changes.Performance issues need timing, query text, plans, and session details, not only average CPU graphs.
StorageFile latency, free space, log growth, tempdb pressure, version-store growth, and abnormal file growth.Storage pressure can appear as slow queries, blocking, failed backups, or database growth incidents.
AvailabilityAG replica health, synchronization state, listener problems, failover events, and DR jobs that stopped running.HA and DR only help when the monitoring system sees role changes and broken dependencies.
ResponseAlerts that are routed, acknowledged, repeated when needed, and tied to a first-check note.A correct alert still fails if it goes to an abandoned mailbox or gives nobody enough context to start.

How to build SQL Server monitoring coverage

Build the monitoring setup from production failures backward. Decide which failures must be caught, which SQL Server data source can show them, how often to collect it, how long to keep it, and who gets the alert.

The order matters: critical failures first, data sources second, collection method third, then retention, thresholds, alert routing, response notes, and a regular review cycle.

Build stepWhat to decideWhy it matters
Failure listBackups, jobs, DBCC, blocking, deadlocks, storage, tempdb, Query Store, HA, and error-log events.Monitoring should catch failures that affect recovery, performance, or support work.
Data sourceDMV, Query Store, Extended Events, SQL Agent history, msdb, error log, Performance Monitor, or vendor collector.The source controls what can be trusted, retained, and alerted on.
CollectionPolling interval, collector location, permissions, repository, and retention.A one-time query is useful, but it is not monitoring.
Alert ruleThreshold, delay, repeat behavior, severity, and suppression rules.Bad thresholds create noise; weak thresholds miss incidents.
ResponseFirst check, escalation path, and the person responsible for action.An alert without a response path becomes another ignored email.
Review cycleMonthly or quarterly alert review, false-positive cleanup, and missed-failure review.The setup needs tuning as databases, jobs, workload, and HA topology change.

How to check SQL Server instance context

Reads server identity, edition, startup time, CPU, memory, clustered state, and HADR state.

SELECT
    SERVERPROPERTY('ServerName') AS server_name,
    SERVERPROPERTY('MachineName') AS machine_name,
    SERVERPROPERTY('Edition') AS edition,
    SERVERPROPERTY('ProductVersion') AS product_version,
    SERVERPROPERTY('ProductLevel') AS product_level,
    SERVERPROPERTY('ProductUpdateLevel') AS product_update_level,
    SERVERPROPERTY('IsClustered') AS is_clustered,
    SERVERPROPERTY('IsHadrEnabled') AS is_hadr_enabled,
    si.sqlserver_start_time,
    si.cpu_count,
    si.scheduler_count,
    si.physical_memory_kb / 1024 AS physical_memory_mb,
    si.committed_kb / 1024 AS sql_committed_mb,
    si.committed_target_kb / 1024 AS sql_target_mb
FROM sys.dm_os_sys_info AS si;

sqlserver_start_time tells you how long DMV counters have been accumulating. CPU and memory values give context for pressure checks. IsClustered and IsHadrEnabled show whether cluster or AG checks may apply. This query needs metadata access and may need VIEW SERVER STATE for DMV details.

SQL Server monitoring tools and where they fit

SQL Server already exposes useful monitoring data through DMVs, Query Store, Extended Events, SQL Agent history, msdb backup tables, the SQL Server error log, and Performance Monitor counters.

Performance Monitor is still useful when SQL Server and Windows counters need to be read together. Keep it focused; collecting too many counters adds overhead. Extended Events is the modern lightweight event framework for deadlocks, blocked-process reports, long queries, and targeted troubleshooting.

Tool modelUse it for
DMVsCurrent requests, waits, file latency, memory, schedulers, sessions, and AG state.
Query StoreQuery runtime history, plans, regressions, and release or upgrade comparisons.
Extended EventsDeadlocks, blocked-process reports, expensive statements, login failures, and targeted event capture.
SQL Agent and Database MailJob scheduling, job history, operators, alerts, and email delivery.
msdb historyBackups, restores, SQL Agent job output, maintenance jobs, and some CHECKDB history if jobs log it.
SQL Server error logSeverity errors, startup messages, backup problems, login problems, AG messages, and failed service-level events.
Performance MonitorWindows and SQL counters together: CPU, memory, disk I/O, buffer manager, and workload rate.
Custom SQL Agent or PowerShell checksTargeted backup, job, DBCC, file, blocking, or capacity checks missing from the main tool.
SQL-focused third-party toolsCentral alerting, longer history, query diagnosis, baselines, estate views, and DBA workflows.
APM or observability platformsSQL metrics and query data beside application traces, logs, services, and infrastructure.
Azure Monitor and database watcherAzure SQL Database and Managed Instance monitoring through supported Azure paths.
Recurring DBA reviewRegular review of backups, jobs, alerts, changes, and SQL Server health; not a replacement for live alerting.

Best third-party SQL Server monitoring tools

There is no single best SQL Server monitoring tool for every company. The better question is which tool catches the failures your people must act on, keeps enough history, and fits the platform you run.

Treat vendor feature lists as a shortlist, not the final decision. Trial backup alerts, SQL Agent coverage, stale DBCC detection, blocking capture, deadlocks, Query Store detail, file latency, tempdb, AG state, retention, permissions, and alert routing before trusting the tool.

Redgate Monitor

Strong fit for SQL Server and database estate monitoring, alerting, query diagnosis, deployment context, and practical DBA workflows.

Trial check: failed backups, SQL Agent failures, blocking, deadlocks, Query Store details, alert routing, and repository sizing.

Common gap: a good rollout still needs alert tuning and a clear response path for each production instance.

SolarWinds SQL Sentry

Strong fit for SQL Server and Azure SQL performance monitoring, Top SQL, blocking, deadlocks, Event Calendar, Always On, and forecasting.

Trial check: collection service placement, query capture settings, retention, deadlock and blocking capture, and alert integration.

Common gap: job, backup, DBCC, and support-process coverage still need checking against the local setup.

IDERA SQL Diagnostic Manager

Strong fit for SQL Server monitoring across on-prem, hybrid, and cloud setups with waits, blocking, deadlocks, tempdb, I/O, query, and root-cause workflows.

Trial check: permissions, repository growth, retention, wait and blocking history, tempdb, file latency, and useful alert defaults.

Common gap: noise control matters; a broad alert set can become background email if nobody tunes it.

Datadog SQL Server integration and Database Monitoring

Strong fit when Datadog already monitors applications, infrastructure, logs, and service health, and SQL Server metrics need to sit beside that data.

Trial check: agent setup, database monitoring configuration, query sampling, service correlation, cost, and retention.

Common gap: SQL Agent, backups, DBCC, restore checks, and DBA-specific runbooks may need custom checks.

New Relic Microsoft SQL monitoring

Strong fit when New Relic is already the application and infrastructure monitoring system and SQL instance metrics need to join that view.

Trial check: integration depth, query-level requirements, alert rules, retention, and incident workflow.

Common gap: DBA-specific checks may need custom collection when the main requirement is SQL operations rather than app observability.

Azure Monitor and database watcher

Strong fit for Azure SQL Database, Azure SQL Managed Instance, and Azure estates already using Azure alerts, logs, and dashboards.

Trial check: platform support, database watcher availability, data store choice, private connectivity, and alert routing.

Common gap: SQL Insights is retired, and SQL Server on Azure VM uses different monitoring options than Azure SQL Database or Managed Instance.

How to choose a SQL Server monitoring tool

Choose the tool from the failure list, not from the dashboard screenshots. The best SQL Server monitoring tool is the one that catches the failures people actually need to act on.

Trial checkWhat to verify
Failed backup detectionFull, differential, and log backup age by database, including failed backup jobs and stale databases.
SQL Agent failure detectionFailed, retried, disabled, long-running, and no-recent-run jobs.
Stale CHECKDB detectionLast successful integrity check by database, failed DBCC jobs, and output retention.
Blocking chain captureHead blocker, blocked sessions, wait resource, query text, host, login, and transaction age.
Deadlock graph captureDeadlock XML or graph, victim, process list, resources, query text, and application context.
Query text and plan captureTop SQL, plans, Query Store data, regressions, parameter patterns, and retention.
File latency and tempdb historyRead/write latency, file growth, log growth, tempdb space, version store, and spills.
AG and listener coverageReplica role, synchronization, queue sizes, failover state, listener behavior, and replica-specific jobs.
Retention and repository sizingEnough history for releases, incidents, month-end work, and normal baseline comparison.
Alert routing and escalationSeverity, first check, repeat behavior, suppression, responsible person, and escalation integration.
Permissions and collection overheadLeast-privilege collection, SQL Server 2022 permission changes, sampling cost, and repository growth.

What to monitor in SQL Server

A useful SQL Server monitoring setup checks recovery, workload, storage, maintenance, and topology together. CPU and memory are not enough.

AreaMonitor
AvailabilityInstance up, database state, application connection failures, SQL Server error log, service restarts, and recent failovers.
RecoveryFull, differential, and log backup age; backup failures; restore test age; restore job output; and log-chain gaps.
MaintenanceSQL Agent failures, long-running jobs, stale CHECKDB, index/statistics jobs, cleanup jobs, and Database Mail failures.
PerformanceWaits, active requests, blocking, deadlocks, Query Store regressions, CPU, memory, worker pressure, and batch rate.
StorageFree space, file growth, log growth, I/O latency, disk queues, backup throughput, and data/log placement.
tempdbFile sizes, growth, free space, internal objects, user objects, version store, spills, and contention symptoms.
HA and DRAG synchronization, redo/log send queues, replica health, backup preference, failover readiness, jobs on each replica, and listener behavior.
Security-relevant changesFailed login bursts, new sysadmins, changed service accounts, changed Agent proxies, and audit or compliance job failures where used.

SQL Server monitoring alerts that need action

Alerts should be routed by production impact and first action. Do not page people for informational events they cannot change.

Each alert needs a threshold, delay, repeat rule, responsible person, first check, and escalation route. Otherwise it becomes background noise.

AlertFirst check
Backup is stale or failedmsdb backup history, job output, storage path, backup target, and recent database state.
SQL Agent job failedFailed step, job owner, proxy or credential, command path, job duration, and last successful run.
DBCC CHECKDB is staleMaintenance job history, database list, runtime window, output table or log file, and last known success.
Blocking is above thresholdHead blocker, running statement, transaction age, wait resource, login, host, and application.
Deadlock burstDeadlock graph, application path, object/index names, isolation level, retry behavior, and release timing.
Data or log file is low on spaceGrowth setting, drive free space, log reuse wait, recent bulk work, and backup/log backup state.
AG replica is unhealthySynchronization state, queues, replica connection, failover mode, backup preference, and listener state.
High file latencyAffected files, read/write split, storage path, workload timing, backup activity, and VM/storage events.

How to monitor SQL Server performance and blocking

Monitoring should capture waits, active requests, blockers, query text, reads, writes, CPU, and duration while the issue is happening. After the incident, the useful details are usually gone unless the tool captured them.

How to read SQL Server wait stats

Lists top cumulative waits while excluding common background waits.

SELECT TOP (25)
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    signal_wait_time_ms,
    wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms,
    CONVERT(decimal(18,2), wait_time_ms / 1000.0) AS wait_seconds
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
    N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP',
    N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE',
    N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'LAZYWRITER_SLEEP',
    N'LOGMGR_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE',
    N'SERVER_IDLE_CHECK', N'SLEEP_TASK', N'SLEEP_SYSTEMTASK',
    N'SQLTRACE_BUFFER_FLUSH', N'WAITFOR', N'XE_DISPATCHER_WAIT',
    N'XE_TIMER_EVENT'
)
ORDER BY wait_time_ms DESC;

wait_seconds helps identify the largest cumulative waits since startup. signal_wait_time_ms can point toward scheduler pressure. High wait totals need workload timing and baseline context. Do not reset wait stats casually in production. Recent SQL Server versions can require VIEW SERVER PERFORMANCE STATE for some performance DMVs.

How to capture current waiting tasks

Shows sessions waiting right now, including wait type, duration, blocking session, resource, command, and database.

SELECT TOP (50)
    wt.session_id,
    wt.exec_context_id,
    wt.wait_type,
    wt.wait_duration_ms,
    wt.blocking_session_id,
    wt.resource_description,
    r.status,
    r.command,
    DB_NAME(r.database_id) AS database_name,
    r.cpu_time,
    r.logical_reads,
    r.reads,
    r.writes
FROM sys.dm_os_waiting_tasks AS wt
LEFT JOIN sys.dm_exec_requests AS r
    ON r.session_id = wt.session_id
ORDER BY wt.wait_duration_ms DESC;

wait_duration_ms shows which waits are active long enough to matter. blocking_session_id points to live blocking when the value is not zero. resource_description can identify locks, pages, latches, or other resources. Use this with active requests before changing server settings.

How to capture active requests and blockers

Shows current requests with blocking session, wait details, resource use, database, and running statement.

SELECT
    r.session_id,
    s.host_name,
    s.program_name,
    s.login_name,
    DB_NAME(r.database_id) AS database_name,
    r.status,
    r.command,
    r.blocking_session_id,
    r.wait_type,
    r.wait_time,
    r.wait_resource,
    r.cpu_time,
    r.logical_reads,
    r.reads,
    r.writes,
    SUBSTRING(
        t.text,
        (r.statement_start_offset / 2) + 1,
        (
            (CASE r.statement_end_offset
                WHEN -1 THEN DATALENGTH(t.text)
                ELSE r.statement_end_offset
             END - r.statement_start_offset) / 2
        ) + 1
    ) AS running_statement
FROM sys.dm_exec_requests AS r
JOIN sys.dm_exec_sessions AS s
    ON s.session_id = r.session_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS t
WHERE r.session_id <> @@SPID
ORDER BY r.cpu_time DESC, r.logical_reads DESC;

blocking_session_id identifies live blocking chains. wait_type and wait_resource show what the request is waiting on right now. logical_reads, reads, writes, and cpu_time help prioritize active work. Use the blocking guide when blocking is active and the chain needs deeper review.

How to find Query Store workload candidates

Run inside a user database with Query Store enabled to find recent high-duration or high-read queries.

SELECT TOP (25)
    OBJECT_NAME(q.object_id) AS object_name,
    qt.query_sql_text,
    p.plan_id,
    SUM(rs.count_executions) AS executions,
    CONVERT(decimal(18,2), SUM(rs.avg_duration * rs.count_executions) / NULLIF(SUM(rs.count_executions), 0) / 1000.0) AS avg_duration_ms,
    CONVERT(decimal(18,2), SUM(rs.avg_cpu_time * rs.count_executions) / NULLIF(SUM(rs.count_executions), 0) / 1000.0) AS avg_cpu_ms,
    CONVERT(decimal(18,2), SUM(rs.avg_logical_io_reads * rs.count_executions) / NULLIF(SUM(rs.count_executions), 0)) AS avg_logical_reads,
    MAX(rsi.end_time) AS last_interval_end
FROM sys.query_store_runtime_stats AS rs
JOIN sys.query_store_runtime_stats_interval AS rsi
    ON rsi.runtime_stats_interval_id = rs.runtime_stats_interval_id
JOIN sys.query_store_plan AS p
    ON p.plan_id = rs.plan_id
JOIN sys.query_store_query AS q
    ON q.query_id = p.query_id
JOIN sys.query_store_query_text AS qt
    ON qt.query_text_id = q.query_text_id
WHERE rsi.start_time >= DATEADD(day, -7, SYSUTCDATETIME())
GROUP BY OBJECT_NAME(q.object_id), qt.query_sql_text, p.plan_id
ORDER BY avg_duration_ms DESC, avg_logical_reads DESC;

This query is a candidate list, not a full before-and-after comparison. Capture a baseline window before releases, upgrades, or major maintenance. Compare the same workload window after the change: duration, CPU, reads, executions, and plans. Use Query Store output with waits, plans, and release timing. Query Store queries need database access and Query Store enabled in the user database.

How to monitor SQL Server backups, jobs, and DBCC

Monitoring that misses backups, restores, DBCC, and SQL Agent jobs is incomplete. The application can still work while the recovery path is already broken.

How to check backup monitoring coverage

Reads msdb backup history to show recent backup coverage by backup type.

WITH LastBackups AS (
    SELECT
        bs.database_name,
        bs.type,
        MAX(bs.backup_finish_date) AS last_backup_finish_date
    FROM msdb.dbo.backupset AS bs
    WHERE bs.type IN ('D', 'I', 'L')
    GROUP BY bs.database_name, bs.type
)
SELECT
    d.name AS database_name,
    d.recovery_model_desc,
    d.state_desc,
    MAX(CASE WHEN lb.type = 'D' THEN lb.last_backup_finish_date END) AS last_full_backup,
    MAX(CASE WHEN lb.type = 'I' THEN lb.last_backup_finish_date END) AS last_differential_backup,
    MAX(CASE WHEN lb.type = 'L' THEN lb.last_backup_finish_date END) AS last_log_backup
FROM sys.databases AS d
LEFT JOIN LastBackups AS lb
    ON lb.database_name = d.name
WHERE d.name <> N'tempdb'
GROUP BY d.name, d.recovery_model_desc, d.state_desc
ORDER BY d.name;

last_full_backup is the base recovery input. last_log_backup matters for databases in full or bulk-logged recovery model. A missing backup date needs review before trusting the monitoring setup. Backup history does not replace restore testing. The msdb backup tables require access to msdb history.

How to check SQL Agent failures

Reads recent SQL Agent history for failed, retrying, canceled, in-progress, or long-running job steps.

SELECT TOP (100)
    j.name AS job_name,
    SUSER_SNAME(j.owner_sid) AS job_owner,
    h.step_id,
    h.step_name,
    CASE h.run_status
        WHEN 0 THEN 'Failed'
        WHEN 1 THEN 'Succeeded'
        WHEN 2 THEN 'Retry'
        WHEN 3 THEN 'Canceled'
        WHEN 4 THEN 'In progress'
    END AS run_status,
    msdb.dbo.agent_datetime(h.run_date, h.run_time) AS run_datetime,
    ((h.run_duration / 10000) * 3600)
        + (((h.run_duration % 10000) / 100) * 60)
        + (h.run_duration % 100) AS duration_seconds,
    h.message
FROM msdb.dbo.sysjobhistory AS h
JOIN msdb.dbo.sysjobs AS j
    ON j.job_id = h.job_id
WHERE h.run_status <> 1
   OR h.run_duration >= 010000
ORDER BY run_datetime DESC;

run_status identifies failed, retrying, canceled, and in-progress work. duration_seconds helps catch jobs that now exceed the normal window. job_owner should be valid and intentional. Also check job steps, proxies, credentials, operators, alerts, Database Mail, CmdExec, PowerShell, SSIS, and output files.

How to find stale CHECKDB monitoring

Looks for DBCC CHECKDB mentions in SQL Agent history. Use this as a starting point, not a complete integrity-check audit.

SELECT
    d.name AS database_name,
    d.state_desc,
    d.user_access_desc,
    d.recovery_model_desc,
    MAX(CASE
        WHEN h.step_name LIKE N'%CHECKDB%'
          OR h.message LIKE N'%DBCC CHECKDB%'
        THEN msdb.dbo.agent_datetime(h.run_date, h.run_time)
    END) AS last_checkdb_seen_in_job_history
FROM sys.databases AS d
LEFT JOIN msdb.dbo.sysjobhistory AS h
    ON h.message LIKE N'%' + d.name + N'%'
WHERE d.name <> N'tempdb'
GROUP BY d.name, d.state_desc, d.user_access_desc, d.recovery_model_desc
ORDER BY d.name;

This depends on job history text and job retention, so it can miss custom logging. A better setup stores CHECKDB output in a table or log path with alerting. Monitor both stale CHECKDB and failed CHECKDB. Use the backup guide when restore testing and CHECKDB coverage need deeper review.

How to monitor SQL Server storage, tempdb, and capacity

Free space is not enough. Storage monitoring needs file latency, growth behavior, log growth, data and log placement, backup throughput, tempdb space use, version store, internal objects, and capacity trend.

How to check database file latency

Reads database file paths, sizes, reads/writes, and average I/O stall from sys.dm_io_virtual_file_stats.

SELECT
    DB_NAME(mf.database_id) AS database_name,
    mf.type_desc,
    mf.name AS logical_file_name,
    mf.physical_name,
    CONVERT(decimal(18,2), mf.size * 8.0 / 1024.0) AS size_mb,
    vfs.num_of_reads,
    vfs.num_of_writes,
    CONVERT(decimal(18,2), vfs.io_stall_read_ms * 1.0 / NULLIF(vfs.num_of_reads, 0)) AS avg_read_ms,
    CONVERT(decimal(18,2), vfs.io_stall_write_ms * 1.0 / NULLIF(vfs.num_of_writes, 0)) AS avg_write_ms,
    CONVERT(decimal(18,2), vfs.io_stall * 1.0 / NULLIF(vfs.num_of_reads + vfs.num_of_writes, 0)) AS avg_io_ms
FROM sys.master_files AS mf
CROSS APPLY sys.dm_io_virtual_file_stats(mf.database_id, mf.file_id) AS vfs
ORDER BY avg_io_ms DESC, database_name, mf.file_id;

physical_name shows where data and log files live. avg_read_ms and avg_write_ms are starting points for storage review. Log files with high write latency need separate attention. Compare this with workload timing before blaming storage. File latency DMVs usually require server-level performance permissions.

How to check tempdb usage

Run in tempdb to review file sizes, free space, user objects, internal objects, version store, and growth settings.

USE tempdb;

SELECT
    df.file_id,
    df.name AS logical_file_name,
    df.type_desc,
    df.physical_name,
    CONVERT(decimal(18,2), df.size * 8.0 / 1024.0) AS file_size_mb,
    CONVERT(decimal(18,2), fsu.unallocated_extent_page_count * 8.0 / 1024.0) AS free_space_mb,
    CONVERT(decimal(18,2), fsu.user_object_reserved_page_count * 8.0 / 1024.0) AS user_object_mb,
    CONVERT(decimal(18,2), fsu.internal_object_reserved_page_count * 8.0 / 1024.0) AS internal_object_mb,
    CONVERT(decimal(18,2), fsu.version_store_reserved_page_count * 8.0 / 1024.0) AS version_store_mb,
    CASE
        WHEN df.is_percent_growth = 1 THEN CONCAT(df.growth, N'%')
        ELSE CONCAT(CONVERT(decimal(18,2), df.growth * 8.0 / 1024.0), N' MB')
    END AS growth_setting
FROM tempdb.sys.database_files AS df
LEFT JOIN sys.dm_db_file_space_usage AS fsu
    ON fsu.file_id = df.file_id
ORDER BY df.file_id;

version_store_mb matters when row versioning or long transactions are active. internal_object_mb can rise with sorts, hashes, spools, and spills. growth_setting should be predictable and large enough for production use. Use the tempdb guide when pressure keeps returning.

How to monitor SQL Server HA, DR, and failover

HA monitoring should check whether failover can actually work. Replica health, synchronization state, queues, backup preference, jobs, listener behavior, and failover drill output all matter.

How to check Availability Group monitoring inputs

Shows AG primary, replica mode, failover mode, connection state, synchronization health, and queue size.

SELECT
    ag.name AS availability_group_name,
    ags.primary_replica,
    ar.replica_server_name,
    ars.role_desc,
    ar.availability_mode_desc,
    ar.failover_mode_desc,
    ars.connected_state_desc,
    ars.synchronization_health_desc,
    drs.database_id,
    DB_NAME(drs.database_id) AS database_name,
    drs.synchronization_state_desc,
    drs.synchronization_health_desc AS database_synchronization_health,
    drs.log_send_queue_size,
    drs.redo_queue_size
FROM sys.availability_groups AS ag
JOIN sys.dm_hadr_availability_group_states AS ags
    ON ags.group_id = ag.group_id
JOIN sys.availability_replicas AS ar
    ON ar.group_id = ag.group_id
LEFT JOIN sys.dm_hadr_availability_replica_states AS ars
    ON ars.replica_id = ar.replica_id
LEFT JOIN sys.dm_hadr_database_replica_states AS drs
    ON drs.replica_id = ar.replica_id
ORDER BY ag.name, ar.replica_server_name, database_name;

primary_replica and role_desc show the current AG layout. availability_mode_desc and failover_mode_desc show configured behavior. synchronization_state_desc and queue sizes need attention before relying on failover. No AG rows means this check does not apply to that instance.

How to read recent SQL Server error log entries

Reads recent SQL Server error log entries containing Error:. Filter further for AG, backup, login, memory, I/O, or severity patterns.

EXEC sys.xp_readerrorlog 0, 1, N'Error:', NULL, NULL, NULL, N'desc';

The error log is noisy, but it catches service-level problems other queries can miss. Monitor severity errors, login bursts, AG messages, backup failures, memory errors, I/O errors, and restart messages. Use targeted filters or Extended Events for recurring patterns. Some environments restrict xp_readerrorlog; use approved logging access where required.

How to build a SQL Server monitoring baseline

A threshold only makes sense when it is compared with normal behavior. Build baseline windows before deciding what should page someone.

Capture normal business hours, batch windows, month-end or reporting work, maintenance windows, and the period after releases, storage changes, VM changes, SQL patches, or compatibility-level changes.

How to collect baseline performance counters

Reads common SQL Server counters that can seed monitoring baselines. This is a starting point, not complete monitoring.

SELECT
    object_name,
    counter_name,
    instance_name,
    cntr_value,
    cntr_type
FROM sys.dm_os_performance_counters
WHERE counter_name IN (
    N'Batch Requests/sec',
    N'Page life expectancy',
    N'Lazy writes/sec',
    N'Checkpoint pages/sec',
    N'Log Flushes/sec',
    N'Log Flush Wait Time',
    N'User Connections',
    N'Processes blocked'
)
ORDER BY object_name, counter_name, instance_name;

Batch Requests/sec is workload rate, not health by itself. Page life expectancy needs local baseline context and memory size context. Some per-second counters are cumulative and need interval sampling. Use this with job schedules and business timing before setting thresholds.

Baseline windowWhy it matters
Business hoursShows normal user workload, concurrency, waits, and request volume.
Batch windowsSeparates expected job pressure from unexpected production pressure.
Month-end or reportingPrevents heavy but expected workload from looking like an incident.
Maintenance windowsKeeps backup, CHECKDB, index, statistics, and cleanup pressure separate from user workload.
After releases or changesHelps catch changed query shape, job timing, resource use, or alert noise.

What not to do with SQL Server monitoring

1

Do not monitor only CPU, memory, and disk free space.

2

Do not buy a monitoring tool and leave alerts without a response path.

3

Do not alert on every transient wait or one-off CPU spike.

4

Do not ignore backups, jobs, DBCC, and error logs because the application is online.

5

Do not copy thresholds from another server without checking the local workload.

6

Do not give monitoring logins more permissions than they need.

7

Do not rely on screenshots when history and retention are needed.

8

Do not treat a green dashboard as restore, failover, or performance validation.

When to request a SQL Server monitoring review

Get help when alerts are noisy, failures are missed, incidents have too little history, or nobody can say which SQL Server checks matter most.

A one-time health audit can check the current setup. Monthly SQL Server DBA support makes more sense when backups, jobs, alerts, changes, and monitoring need regular review.

Send the SQL Server version, instance count, current monitoring tool, alert list, backup history, job history, DBCC history, wait stats, blocking or deadlock examples, Query Store output, storage and tempdb data, and AG or failover status where relevant.

Next step

Use the SQL Server health audit when monitoring is one part of a wider SQL Server check.

Use monthly SQL Server DBA support when backups, jobs, alerts, performance checks, and planned changes need regular attention.

Next useful reads: the SQL Server backup guide, the SQL Server blocking guide, the SQL Server waits guide, the SQL Server failover guide, and the SQL Server tempdb guide.