sql server / health check checklist

SQL Serverhealth check checklist

A practical DBA checklist for reviewing whether a SQL Server is safe enough to operate, change, or hand over.

Use it to check the basics that hurt production quietly: backups, restores, SQL Agent, monitoring, configuration, patch level, security, HA/DR, and workload pressure. For scripts and deeper DMV checks, use the SQL Server health check guide.

SQL Server health check guide / SQL Server backup guide / SQL Server maintenance plan guide / SQL Server monitoring gaps / SQL Server health audit

Guide

Checklist~11 min readUpdated 30 May 2026

LinkedIn X Email

On this page

1. How to use this SQL Server health check checklist
2. SQL Server health check checklist
3. SQL Server health check checklist item notes
4. What to do after the SQL Server health check checklist

Start here

How to use this SQL Server health check checklist

Work through the table in order. The first rows are deliberately recovery and ownership checks because those failures hurt fastest.

Use the notes below the table when a row is weak or unclear. They explain what the result usually means and what to do next.

Do not turn every row into immediate change work. Sort findings into fix now, schedule, review deeper, watch, or accept for now.

Master checklist

SQL Server health check checklist

Start with the items that can stop recovery or hide failed work: scope, ownership, backups, restores, and alerts. Then move into capacity, performance, security, HA/DR, and the actions worth taking after the review.

Priority	Checklist item	Area	Why it matters	What to look for
01	Production instance and database scope	Scope	You cannot review SQL Server health until the production instances, databases, and dependencies are clear.	Instance names, environments, critical databases, listeners, applications, reporting loads, vendor systems, and DR targets.
02	Business-critical databases	Scope	Not every database deserves the same attention. Critical databases need the tightest recovery, monitoring, and change checks.	Revenue, operations, customer-facing, identity, integration, reporting, and batch databases that would hurt the business if unavailable.
03	SQL Server ownership and escalation path	Ownership	Backups, alerts, jobs, and incidents fail slowly when nobody is clearly responsible for acting on them.	Named SQL owner, change owner, backup responder, incident contact, vendor boundary, and after-hours path.
04	Recent full backup history	Recovery	Full backups are the base of most restore plans. Missing or stale full backups make the rest of recovery harder.	Last full backup per important database, failed runs, copy-only surprises, duration changes, and skipped databases.
05	Differential and log backup coverage	Recovery	Differential and log backups usually decide how much data the company loses after a restore.	Recovery model, last differential backup, last log backup, broken log-chain patterns, and databases in FULL without log backups.
06	Restore test history	Recovery	Backup success does not mean restore success. Restore testing is where many backup plans stop looking good.	Last restore test, target server, restored database name, duration, CHECKDB after restore, and who ran it.
07	DBCC CHECKDB or integrity-check coverage	Integrity	Corruption is rare until it is not. Integrity checks tell you whether backups may contain damaged data.	Recent DBCC CHECKDB history, failures, skipped large databases, physical-only shortcuts, and where results are logged.
08	Backup destination, retention, and free space	Recovery	Backups can be technically successful and still useless if storage fills, cleanup deletes too early, or files never leave the server.	Backup path, free space, retention period, cleanup job, off-server copy, encryption, compression, and access control.
09	SQL Agent failed jobs	Jobs	Failed jobs often hide backup, ETL, reporting, maintenance, and cleanup problems.	Failed jobs, failed steps, retry loops, disabled jobs, long-running jobs, and jobs that stopped writing useful history.
10	SQL Agent notifications and operators	Jobs	A failed job is only useful if the right person finds out quickly enough.	SQL Agent operators, Database Mail, notification settings, alert recipients, stale email addresses, and failed mail delivery.
11	Critical maintenance jobs	Maintenance	Maintenance jobs can keep an environment stable, or they can waste I/O and hide stale assumptions.	Backup jobs, CHECKDB jobs, index/statistics jobs, cleanup jobs, history retention, schedule overlap, and runtime drift.
12	Disk free space and file growth	Capacity	Low disk space and poor growth settings turn normal workload changes into outages or long pauses.	Drive free space, database file sizes, growth increments, percent growth, max size, file placement, and recent autogrowth events.
13	Database log growth and reuse waits	Capacity	Transaction log problems usually point to backup gaps, long transactions, replication, AG queues, or workload changes.	Log size, VLF shape, log growth, log reuse wait reason, long transactions, and last log backup.
14	tempdb size, growth, and pressure	Capacity	tempdb pressure affects sorting, version store, spills, row versioning, and many production workloads.	File count, size, growth, free space, version store, spills, allocation waits, and sudden growth.
15	SQL Server error log warnings	Logs	The error log often shows the first clear signs of failed backups, failed logins, I/O issues, memory pressure, and startup problems.	Recent errors, warnings, login failures, stack dumps, I/O messages, memory messages, and repeated startup entries.
16	Severity errors and corruption messages	Logs	High-severity errors and corruption messages can mean damaged data, storage issues, or failing components.	Severity 17-25 messages, 823, 824, 825, CHECKDB failures, suspect pages, and stack dumps.
17	Blocking, deadlocks, and long transactions	Performance	Concurrency problems can make a healthy-looking server feel broken to users.	Head blockers, deadlock graphs, long transactions, lock waits, blocked sessions, and application timing.
18	Top waits and resource pressure	Performance	Waits help separate CPU, memory, I/O, locking, logging, and network pressure.	Top waits, signal wait time, resource waits, active requests, CPU pressure, memory grants, and I/O waits.
19	Query Store status and plan regressions	Performance	Query Store helps find plan changes and expensive queries without guessing from one live moment.	Query Store enabled state, read-write state, capture mode, size limits, top regressions, and forced plans.
20	Max server memory, MAXDOP, and cost threshold	Config	Core instance settings can create avoidable memory pressure or parallelism noise when left at poor defaults.	Max server memory, min server memory, MAXDOP, cost threshold for parallelism, edition limits, CPU count, and workload type.
21	Storage latency and I/O stalls	Storage	Slow storage affects backups, restores, queries, tempdb, logs, and maintenance windows.	Virtual file stats, read latency, write latency, log write waits, file placement, and storage alerts.
22	Patch level and support status	Security	Unsupported or badly outdated builds make troubleshooting, security, and upgrade planning harder.	SQL Server version, product level, CU/GDR level, edition, Windows version, support lifecycle, and pending patches.
23	Sysadmin access and elevated logins	Security	Too many sysadmins makes change control, incident review, and security cleanup much harder.	Sysadmin members, server roles, database owners, shared accounts, old admin logins, and application logins with elevated rights.
24	Service accounts, proxies, and credentials	Security	Automation often runs with more access than it needs, and broken credentials can quietly break jobs.	SQL Server service accounts, SQL Agent service account, proxies, credentials, job owners, and external resource access.
25	Unsafe enabled features and linked servers	Security	Optional features and linked servers can widen exposure if nobody remembers why they are enabled.	xp_cmdshell, Ole Automation, CLR, external scripts, ad hoc distributed queries, linked servers, and remote login mappings.
26	Monitoring alerts and routing	Monitoring	Monitoring only helps when alerts reach people who can act and include enough detail to start diagnosis.	Backup alerts, failed-job alerts, disk alerts, service alerts, severity alerts, deadlock alerts, and recipient routing.
27	Alert noise and ignored warnings	Monitoring	Too much noise teaches people to ignore the system, including the alerts that matter.	Muted alerts, repeated alerts, ignored warnings, unclear thresholds, dashboards nobody checks, and tickets closed without action.
28	HA/DR status and replica health	HA / DR	A green HA screen does not mean failover or recovery will work cleanly.	AG health, cluster health, replica synchronization, send queue, redo queue, failover mode, backup preference, and DR lag.
29	Failover readiness: jobs, logins, DNS, app behavior	HA / DR	Failover breaks in the boring places: missing jobs, missing logins, DNS, firewall rules, and application connection behavior.	Jobs on secondary nodes, logins, linked servers, credentials, operators, listener DNS, firewall rules, and application validation steps.
30	Final action list: fix now, schedule, review deeper, watch, accept	Output	A health check is useful only when it turns findings into the next work list.	Items that need immediate work, next maintenance window work, deeper review, trend watching, or documented acceptance.

Item notes

SQL Server health check checklist item notes

Use these notes after the table. Each one explains what a weak result usually means and what the next action should be.

Scope

Production instance and database scope

Start by deciding exactly which SQL Server instances and databases are in the review. This is partly technical and partly ownership work: production instances, reporting replicas, DR targets, vendor-managed systems, and application dependencies all need to be named before the health check means anything. If the scope is fuzzy, the later findings will be fuzzy too.

Scope

Business-critical databases

Use the inventory query as a helper, not as the final answer. SQL Server can show size, recovery model, owner, and state, but it cannot tell you which database runs payroll, identity, billing, production orders, or customer-facing traffic. Mark critical databases explicitly so backup, restore, monitoring, and maintenance checks are judged against the right level of risk.

Database inventory helper

Use this to list databases, size, owner, state, recovery model, and Query Store status before marking business-critical databases manually.

SELECT
    d.name AS database_name,
    d.state_desc,
    d.recovery_model_desc,
    d.compatibility_level,
    d.is_read_only,
    d.is_query_store_on,
    SUSER_SNAME(d.owner_sid) AS database_owner,
    CONVERT(decimal(18,2), SUM(mf.size) * 8.0 / 1024.0) AS size_mb,
    COUNT(mf.file_id) AS file_count
FROM sys.databases AS d
LEFT JOIN sys.master_files AS mf
    ON mf.database_id = d.database_id
WHERE d.name <> N'tempdb'
GROUP BY
    d.name,
    d.state_desc,
    d.recovery_model_desc,
    d.compatibility_level,
    d.is_read_only,
    d.is_query_store_on,
    d.owner_sid
ORDER BY size_mb DESC, d.name;

This query cannot identify business importance by itself. Use it as an inventory helper, then add application and business context.

Ownership

SQL Server ownership and escalation path

This item is deliberately not a SQL query. The important question is who acts when something fails: backups, jobs, alerts, storage, access, and incidents all need a named owner and a usable escalation path. If the answer is a shared mailbox, an old vendor address, or one person’s memory, write that down as a real finding.

Recovery

Recent full backup history

Full backup history is one of the first technical checks because it affects every restore path. Look for important databases with no recent full backup, copy-only backups being mistaken for the normal base, unusual duration changes, and backups that finish after the maintenance window. Missing full backups should be handled before lower-impact clean-up work.

Recent full backup history

Check the most recent full backups and spot missing, stale, copy-only, or unusually slow full backups.

SELECT
    d.name AS database_name,
    d.recovery_model_desc,
    d.state_desc,
    MAX(CASE WHEN bs.type = 'D' AND bs.is_copy_only = 0 THEN bs.backup_finish_date END) AS last_full_backup,
    MAX(CASE WHEN bs.type = 'D' AND bs.is_copy_only = 1 THEN bs.backup_finish_date END) AS last_copy_only_full_backup,
    MAX(CASE WHEN bs.type = 'D' THEN DATEDIFF(second, bs.backup_start_date, bs.backup_finish_date) END) AS last_full_duration_seconds
FROM sys.databases AS d
LEFT JOIN msdb.dbo.backupset AS bs
    ON bs.database_name = d.name
   AND bs.type = 'D'
WHERE d.name <> N'tempdb'
GROUP BY d.name, d.recovery_model_desc, d.state_desc
ORDER BY last_full_backup, d.name;

A copy-only full backup is useful for ad hoc work, but it is not the normal differential base.

Recovery

Differential and log backup coverage

Differential and log backups show whether the backup plan matches the expected data-loss window. A database in FULL recovery without regular log backups is usually a recovery problem and sometimes a log-growth problem too. Check the pattern per database, not just whether the backup jobs exist.

Differential and log backup coverage

Review latest backup time by type and identify FULL or BULK_LOGGED databases without recent log backups.

WITH LastBackups AS (
    SELECT
        bs.database_name,
        bs.type,
        MAX(bs.backup_finish_date) AS last_backup_finish_date
    FROM msdb.dbo.backupset AS bs
    WHERE bs.type IN ('D', 'I', 'L')
    GROUP BY bs.database_name, bs.type
)
SELECT
    d.name AS database_name,
    d.recovery_model_desc,
    MAX(CASE WHEN lb.type = 'D' THEN lb.last_backup_finish_date END) AS last_full_backup,
    MAX(CASE WHEN lb.type = 'I' THEN lb.last_backup_finish_date END) AS last_differential_backup,
    MAX(CASE WHEN lb.type = 'L' THEN lb.last_backup_finish_date END) AS last_log_backup,
    CASE
        WHEN d.recovery_model_desc IN (N'FULL', N'BULK_LOGGED')
         AND MAX(CASE WHEN lb.type = 'L' THEN lb.last_backup_finish_date END) IS NULL
        THEN N'Missing log backup history'
        ELSE N'Check timing against RPO'
    END AS review_note
FROM sys.databases AS d
LEFT JOIN LastBackups AS lb
    ON lb.database_name = d.name
WHERE d.name <> N'tempdb'
GROUP BY d.name, d.recovery_model_desc
ORDER BY d.name;

This checks history in msdb. If backup history is purged aggressively, compare with your backup tool as well.

Recovery

Restore test history

Restore history is the quickest way to challenge a backup plan that looks fine on paper. A restore test should show the source database, destination database, timing, and whether the restored copy was checked afterward. If restore history is empty on the production server, ask where restore tests are performed and how the result is recorded.

Restore test history

Read restore history recorded in msdb so restore testing is not judged from backup status alone.

SELECT TOP (100)
    rh.destination_database_name,
    rh.restore_date,
    rh.restore_type,
    rh.recovery,
    rh.replace,
    bs.database_name AS source_database_name,
    bs.backup_finish_date,
    bmf.physical_device_name
FROM msdb.dbo.restorehistory AS rh
LEFT JOIN msdb.dbo.backupset AS bs
    ON bs.backup_set_id = rh.backup_set_id
LEFT JOIN msdb.dbo.backupmediafamily AS bmf
    ON bmf.media_set_id = bs.media_set_id
ORDER BY rh.restore_date DESC;

Restore tests may run on a different server. Empty local history is a prompt to find where restore tests are recorded.

Integrity

DBCC CHECKDB or integrity-check coverage

Integrity checks matter because backups can preserve corruption just as reliably as healthy data. This query looks for SQL Agent job history that mentions CHECKDB or integrity checks, which is only a practical starting point. If the environment logs CHECKDB results elsewhere, use that source too and confirm that important databases are not being skipped.

Integrity-check job history

Look for recent SQL Agent job steps that appear to run CHECKDB or integrity checks.

SELECT TOP (100)
    j.name AS job_name,
    h.step_id,
    h.step_name,
    h.run_date,
    h.run_time,
    h.run_duration,
    h.run_status,
    h.message
FROM msdb.dbo.sysjobhistory AS h
JOIN msdb.dbo.sysjobs AS j
    ON j.job_id = h.job_id
WHERE h.run_date >= CONVERT(int, CONVERT(char(8), DATEADD(day, -30, GETDATE()), 112))
  AND (
      h.step_name LIKE N'%CHECKDB%'
      OR h.message LIKE N'%CHECKDB%'
      OR h.step_name LIKE N'%integrity%'
      OR h.message LIKE N'%integrity%'
  )
ORDER BY h.instance_id DESC;

This reads job history only. If integrity checks are logged to tables or external tooling, review that source too.

Recovery

Backup destination, retention, and free space

Backup files need more than successful job status. Review where they land, whether compression and checksums are used, whether retention is long enough, and whether the destination has room to survive the next failure. If all backups live only on the SQL Server host, the recovery plan still has a major weakness.

Backup files and backup options

Review recent backup destination paths, sizes, checksums, compression, and copy-only status.

SELECT TOP (200)
    bs.database_name,
    CASE bs.type
        WHEN 'D' THEN 'Full'
        WHEN 'I' THEN 'Differential'
        WHEN 'L' THEN 'Log'
        ELSE bs.type
    END AS backup_type,
    bs.backup_finish_date,
    CONVERT(decimal(18,2), bs.backup_size / 1024.0 / 1024.0) AS backup_size_mb,
    CONVERT(decimal(18,2), bs.compressed_backup_size / 1024.0 / 1024.0) AS compressed_size_mb,
    bs.has_backup_checksums,
    bs.is_copy_only,
    bmf.physical_device_name
FROM msdb.dbo.backupset AS bs
LEFT JOIN msdb.dbo.backupmediafamily AS bmf
    ON bmf.media_set_id = bs.media_set_id
WHERE bs.backup_finish_date >= DATEADD(day, -14, SYSDATETIME())
ORDER BY bs.backup_finish_date DESC;

This does not verify that files still exist on disk. Use it to check where SQL Server wrote the backups and which options were recorded.

Jobs

SQL Agent failed jobs

Failed SQL Agent jobs often explain health issues before any dashboard does. Focus on backup, integrity, ETL, reporting, cleanup, and business-critical jobs first. Repeated failures should become fix work, not a daily habit of acknowledging the same red status.

Recent SQL Agent failures

Find recent failed SQL Agent job steps and messages.

SELECT TOP (100)
    j.name AS job_name,
    h.step_id,
    h.step_name,
    h.run_date,
    h.run_time,
    h.run_duration,
    h.run_status,
    h.sql_severity,
    h.message
FROM msdb.dbo.sysjobhistory AS h
JOIN msdb.dbo.sysjobs AS j
    ON j.job_id = h.job_id
WHERE h.run_status = 0
  AND h.run_date >= CONVERT(int, CONVERT(char(8), DATEADD(day, -14, GETDATE()), 112))
ORDER BY h.instance_id DESC;

run_status 0 means failed. Review step-level failures, not only the job summary row.

Jobs

SQL Agent notifications and operators

Job notifications are only useful when they reach someone who can act. Check operators, notification settings, and Database Mail status together. A job can be correctly configured and still fail quietly if the operator email is stale or mail delivery is broken.

SQL Agent operators and notifications

Check operators, job notification settings, and Database Mail profile configuration.

SELECT
    o.name AS operator_name,
    o.enabled,
    o.email_address,
    o.last_email_date,
    o.last_email_time
FROM msdb.dbo.sysoperators AS o
ORDER BY o.enabled DESC, o.name;

SELECT
    j.name AS job_name,
    j.enabled AS job_enabled,
    j.notify_level_email,
    o.name AS email_operator_name
FROM msdb.dbo.sysjobs AS j
LEFT JOIN msdb.dbo.sysoperators AS o
    ON o.id = j.notify_email_operator_id
WHERE j.notify_level_email <> 0
   OR j.notify_email_operator_id <> 0
ORDER BY j.name;

SELECT
    p.name AS mail_profile_name,
    p.description
FROM msdb.dbo.sysmail_profile AS p
ORDER BY p.name;

This shows configuration. Send a controlled test through your normal process before trusting alert delivery.

Maintenance

Critical maintenance jobs

Maintenance jobs should match the current workload, not the server as it existed years ago. Look for schedule overlap, long runtimes, old job owners, broad index jobs, cleanup jobs that do not clean enough, and maintenance that competes with business workloads.

Maintenance job timing and schedules

Review maintenance-like jobs, schedules, enabled state, and recent runtime.

SELECT
    j.name AS job_name,
    j.enabled AS job_enabled,
    s.name AS schedule_name,
    s.enabled AS schedule_enabled,
    h.run_date,
    h.run_time,
    h.run_duration,
    h.run_status
FROM msdb.dbo.sysjobs AS j
LEFT JOIN msdb.dbo.sysjobschedules AS js
    ON js.job_id = j.job_id
LEFT JOIN msdb.dbo.sysschedules AS s
    ON s.schedule_id = js.schedule_id
LEFT JOIN msdb.dbo.sysjobhistory AS h
    ON h.job_id = j.job_id
   AND h.step_id = 0
   AND h.run_date >= CONVERT(int, CONVERT(char(8), DATEADD(day, -14, GETDATE()), 112))
WHERE j.name LIKE N'%backup%'
   OR j.name LIKE N'%check%'
   OR j.name LIKE N'%integrity%'
   OR j.name LIKE N'%index%'
   OR j.name LIKE N'%stat%'
   OR j.name LIKE N'%cleanup%'
ORDER BY j.name, h.instance_id DESC;

Job names are not a perfect classifier. Include any local maintenance naming conventions in the review.

Capacity

Disk free space and file growth

File size and growth settings show whether the server can absorb normal workload changes without avoidable pauses. Percent growth, tiny growth increments, unlimited files on nearly full drives, and uneven file placement are all worth flagging before they become incidents.

Database files and growth settings

Review database files, sizes, growth settings, max size, and percent growth.

SELECT
    DB_NAME(mf.database_id) AS database_name,
    mf.type_desc,
    mf.name AS logical_file_name,
    mf.physical_name,
    CONVERT(decimal(18,2), mf.size * 8.0 / 1024.0) AS size_mb,
    CASE
        WHEN mf.is_percent_growth = 1 THEN CONCAT(mf.growth, N'%')
        ELSE CONCAT(CONVERT(decimal(18,2), mf.growth * 8.0 / 1024.0), N' MB')
    END AS growth_setting,
    CASE
        WHEN mf.max_size = -1 THEN N'Unlimited'
        WHEN mf.max_size = 0 THEN N'No growth'
        ELSE CONCAT(CONVERT(decimal(18,2), mf.max_size * 8.0 / 1024.0), N' MB')
    END AS max_size_setting,
    mf.is_percent_growth
FROM sys.master_files AS mf
WHERE DB_NAME(mf.database_id) <> N'tempdb'
ORDER BY database_name, mf.type_desc, mf.file_id;

SQL Server does not expose drive free space through a simple safe catalog view. Pair this with OS or monitoring data.

Capacity

Database log growth and reuse waits

Transaction log growth usually has a reason. The log may be waiting for a log backup, an open transaction, replication, availability group movement, or another process. Find the reuse wait before shrinking files or changing recovery model, because those reactions can hide the cause without fixing it.

Log size and reuse waits

Check log file size and the current reason SQL Server cannot reuse log space.

SELECT
    d.name AS database_name,
    d.recovery_model_desc,
    d.log_reuse_wait_desc,
    CONVERT(decimal(18,2), SUM(CASE WHEN mf.type_desc = N'LOG' THEN mf.size ELSE 0 END) * 8.0 / 1024.0) AS log_size_mb,
    COUNT(CASE WHEN mf.type_desc = N'LOG' THEN 1 END) AS log_file_count
FROM sys.databases AS d
LEFT JOIN sys.master_files AS mf
    ON mf.database_id = d.database_id
WHERE d.name <> N'tempdb'
GROUP BY d.name, d.recovery_model_desc, d.log_reuse_wait_desc
ORDER BY log_size_mb DESC, d.name;

The reuse wait is a current state, not a full history. Check it again during the problem window if log growth is intermittent.

Capacity

tempdb size, growth, and pressure

tempdb supports many ordinary SQL Server operations, so pressure there can look like several different problems at once. Review file sizing, growth, free space, version store, user objects, and internal objects. If tempdb keeps growing, connect the timing to workload, row versioning, spills, or maintenance.

tempdb files and space usage

Review tempdb files, growth settings, free space, user objects, internal objects, and version store usage.

SELECT
    df.file_id,
    df.name AS logical_file_name,
    df.type_desc,
    df.physical_name,
    CONVERT(decimal(18,2), df.size * 8.0 / 1024.0) AS file_size_mb,
    CASE
        WHEN df.is_percent_growth = 1 THEN CONCAT(df.growth, N'%')
        ELSE CONCAT(CONVERT(decimal(18,2), df.growth * 8.0 / 1024.0), N' MB')
    END AS growth_setting,
    CONVERT(decimal(18,2), fsu.unallocated_extent_page_count * 8.0 / 1024.0) AS free_space_mb,
    CONVERT(decimal(18,2), fsu.user_object_reserved_page_count * 8.0 / 1024.0) AS user_object_mb,
    CONVERT(decimal(18,2), fsu.internal_object_reserved_page_count * 8.0 / 1024.0) AS internal_object_mb,
    CONVERT(decimal(18,2), fsu.version_store_reserved_page_count * 8.0 / 1024.0) AS version_store_mb
FROM tempdb.sys.database_files AS df
LEFT JOIN sys.dm_db_file_space_usage AS fsu
    ON fsu.file_id = df.file_id
ORDER BY df.file_id;

This is a current snapshot. Capture it during the pressure window if tempdb problems are intermittent.

Logs

SQL Server error log warnings

Direct error-log reading usually uses server procedures, so this page uses the system_health session instead. It is not a full replacement for reviewing SQL Server error logs, but it gives a safe read-only look at recent reported errors captured by Extended Events.

Recent reported errors from system_health

Read recent error_reported events captured by the system_health Extended Events session.

WITH RingBuffer AS (
    SELECT CAST(t.target_data AS xml) AS target_data
    FROM sys.dm_xe_sessions AS s
    JOIN sys.dm_xe_session_targets AS t
        ON t.event_session_address = s.address
    WHERE s.name = N'system_health'
      AND t.target_name = N'ring_buffer'
),
Events AS (
    SELECT
        DATEADD(hour, DATEDIFF(hour, SYSUTCDATETIME(), SYSDATETIME()), xed.value(N'@timestamp', N'datetime2')) AS event_time,
        xed.value(N'(data[@name="error_number"]/value)[1]', N'int') AS error_number,
        xed.value(N'(data[@name="severity"]/value)[1]', N'int') AS severity,
        xed.value(N'(data[@name="message"]/value)[1]', N'nvarchar(4000)') AS message
    FROM RingBuffer
    CROSS APPLY target_data.nodes(N'//RingBufferTarget/event[@name="error_reported"]') AS XEventData(xed)
)
SELECT TOP (100)
    event_time,
    error_number,
    severity,
    message
FROM Events
ORDER BY event_time DESC;

This reads system_health, not the full SQL Server error log. Use it as a safe starting point.

Logs

Severity errors and corruption messages

High-severity errors, corruption-related messages, and suspect pages deserve a faster response than normal hygiene findings. Use the query to look for serious error events and suspect page records, then review backups, storage, CHECKDB output, and restore options before making changes.

High-severity errors and suspect pages

Check system_health for serious error reports and msdb suspect page records.

WITH RingBuffer AS (
    SELECT CAST(t.target_data AS xml) AS target_data
    FROM sys.dm_xe_sessions AS s
    JOIN sys.dm_xe_session_targets AS t
        ON t.event_session_address = s.address
    WHERE s.name = N'system_health'
      AND t.target_name = N'ring_buffer'
),
Events AS (
    SELECT
        DATEADD(hour, DATEDIFF(hour, SYSUTCDATETIME(), SYSDATETIME()), xed.value(N'@timestamp', N'datetime2')) AS event_time,
        xed.value(N'(data[@name="error_number"]/value)[1]', N'int') AS error_number,
        xed.value(N'(data[@name="severity"]/value)[1]', N'int') AS severity,
        xed.value(N'(data[@name="message"]/value)[1]', N'nvarchar(4000)') AS message
    FROM RingBuffer
    CROSS APPLY target_data.nodes(N'//RingBufferTarget/event[@name="error_reported"]') AS XEventData(xed)
)
SELECT TOP (100)
    event_time,
    error_number,
    severity,
    message
FROM Events
WHERE severity >= 17
   OR error_number IN (823, 824, 825)
ORDER BY event_time DESC;

SELECT
    DB_NAME(database_id) AS database_name,
    file_id,
    page_id,
    event_type,
    error_count,
    last_update_date
FROM msdb.dbo.suspect_pages
ORDER BY last_update_date DESC;

Investigate any rows carefully. Serious errors should be checked against storage, backups, and integrity-check output.

Performance

Blocking, deadlocks, and long transactions

Blocking and long transactions are health-check items because they often expose application behavior, missing indexes, transaction scope, or workload timing. Capture the sessions, waits, commands, and transaction age before killing anything. The goal is to understand the pattern, not just clear the symptom.

Blocking, active requests, and open transactions

Capture active blocking, waits, request details, and transaction age.

SELECT
    r.session_id,
    r.blocking_session_id,
    r.status,
    r.command,
    r.wait_type,
    r.wait_time,
    r.wait_resource,
    r.open_transaction_count,
    DB_NAME(r.database_id) AS database_name,
    s.host_name,
    s.program_name,
    s.login_name
FROM sys.dm_exec_requests AS r
JOIN sys.dm_exec_sessions AS s
    ON s.session_id = r.session_id
WHERE r.session_id <> @@SPID
  AND (r.blocking_session_id <> 0 OR r.wait_type IS NOT NULL)
ORDER BY r.wait_time DESC;

SELECT
    st.session_id,
    at.transaction_begin_time,
    DATEDIFF(second, at.transaction_begin_time, SYSDATETIME()) AS transaction_age_seconds,
    at.transaction_type,
    at.transaction_state
FROM sys.dm_tran_active_transactions AS at
JOIN sys.dm_tran_session_transactions AS st
    ON st.transaction_id = at.transaction_id
WHERE st.session_id <> @@SPID
ORDER BY transaction_age_seconds DESC;

Capture this while the issue is active. After blocking clears, the most useful details may be gone.

Performance

Top waits and resource pressure

Wait stats are useful direction, not a diagnosis by themselves. Use them to see whether the instance is spending time on CPU scheduling, I/O, locks, log writes, memory grants, or other pressure. Then compare that direction with current requests, workload timing, and recent incidents.

Top wait stats

Review high-level wait categories after filtering common idle waits.

SELECT TOP (25)
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    signal_wait_time_ms,
    wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms,
    CONVERT(decimal(18,2), wait_time_ms / 1000.0) AS wait_seconds
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
    N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP',
    N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE',
    N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'LAZYWRITER_SLEEP',
    N'LOGMGR_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE',
    N'SERVER_IDLE_CHECK', N'SLEEP_TASK', N'SLEEP_SYSTEMTASK',
    N'SQLTRACE_BUFFER_FLUSH', N'WAITFOR', N'XE_DISPATCHER_WAIT',
    N'XE_TIMER_EVENT'
)
ORDER BY wait_time_ms DESC;

Wait stats accumulate since last restart or manual clear. Compare with uptime and current symptoms.

Performance

Query Store status and plan regressions

Query Store can make performance review much less speculative, but only when it is enabled, writable, and sized sensibly. Check Query Store state first, then look for queries where recent duration is materially worse than the longer baseline.

Query Store state and recent regressions

Check Query Store state in the current database and compare recent duration with longer history.

SELECT
    desired_state_desc,
    actual_state_desc,
    readonly_reason,
    current_storage_size_mb,
    max_storage_size_mb,
    query_capture_mode_desc
FROM sys.database_query_store_options;

WITH RuntimeStats AS (
    SELECT
        q.query_id,
        qt.query_sql_text,
        rs.last_execution_time,
        rs.avg_duration,
        AVG(rs.avg_duration) OVER (PARTITION BY q.query_id) AS overall_avg_duration
    FROM sys.query_store_query AS q
    JOIN sys.query_store_query_text AS qt
        ON qt.query_text_id = q.query_text_id
    JOIN sys.query_store_plan AS p
        ON p.query_id = q.query_id
    JOIN sys.query_store_runtime_stats AS rs
        ON rs.plan_id = p.plan_id
    WHERE rs.last_execution_time >= DATEADD(day, -7, SYSDATETIME())
)
SELECT TOP (25)
    query_id,
    last_execution_time,
    CONVERT(decimal(18,2), avg_duration / 1000.0) AS avg_duration_ms,
    CONVERT(decimal(18,2), overall_avg_duration / 1000.0) AS overall_avg_duration_ms,
    LEFT(query_sql_text, 600) AS query_text_sample
FROM RuntimeStats
WHERE avg_duration > overall_avg_duration * 2
ORDER BY avg_duration DESC;

Query Store views are database-scoped. Run this in each important user database where Query Store is enabled.

Config

Max server memory, MAXDOP, and cost threshold

Core settings should be reviewed carefully because bad defaults can hurt, but casual changes can hurt too. Check memory, MAXDOP, cost threshold, backup compression default, and ad hoc workload handling against the actual server shape and workload before scheduling any change.

Core instance configuration

Review core settings that commonly affect memory, parallelism, backups, and ad hoc workloads.

SELECT
    name,
    value,
    value_in_use,
    is_dynamic,
    is_advanced,
    description
FROM sys.configurations
WHERE name IN (
    N'max server memory (MB)',
    N'min server memory (MB)',
    N'max degree of parallelism',
    N'cost threshold for parallelism',
    N'backup compression default',
    N'optimize for ad hoc workloads',
    N'remote admin connections',
    N'xp_cmdshell'
)
ORDER BY name;

This only reads configuration. Review workload and server shape before changing values.

Storage

Storage latency and I/O stalls

Storage latency affects backups, restores, data files, log files, tempdb, and query response time. The file latency query helps separate read pressure, write pressure, and log pressure by database file. Treat the numbers as a starting point and compare them with workload windows.

File latency by database file

Use virtual file stats to compare read, write, and overall latency by file.

SELECT
    DB_NAME(mf.database_id) AS database_name,
    mf.type_desc,
    mf.name AS logical_file_name,
    mf.physical_name,
    vfs.num_of_reads,
    vfs.num_of_writes,
    CONVERT(decimal(18,2), vfs.io_stall_read_ms * 1.0 / NULLIF(vfs.num_of_reads, 0)) AS avg_read_ms,
    CONVERT(decimal(18,2), vfs.io_stall_write_ms * 1.0 / NULLIF(vfs.num_of_writes, 0)) AS avg_write_ms,
    CONVERT(decimal(18,2), vfs.io_stall * 1.0 / NULLIF(vfs.num_of_reads + vfs.num_of_writes, 0)) AS avg_io_ms
FROM sys.master_files AS mf
CROSS APPLY sys.dm_io_virtual_file_stats(mf.database_id, mf.file_id) AS vfs
WHERE DB_NAME(mf.database_id) <> N'tempdb'
ORDER BY avg_io_ms DESC, database_name, mf.file_id;

These counters accumulate since file open. Compare with uptime and workload timing.

Security

Patch level and support status

Patch level and support status shape the next maintenance decision. The query gives the local SQL Server build, edition, clustering, HADR state, and host shape; compare the build to Microsoft’s current support and CU information before scheduling an update.

SQL Server build and host shape

Collect version, edition, update level, clustering, HADR, CPU, memory, and startup time.

SELECT
    SERVERPROPERTY('ServerName') AS server_name,
    SERVERPROPERTY('MachineName') AS machine_name,
    SERVERPROPERTY('Edition') AS edition,
    SERVERPROPERTY('ProductVersion') AS product_version,
    SERVERPROPERTY('ProductLevel') AS product_level,
    SERVERPROPERTY('ProductUpdateLevel') AS product_update_level,
    SERVERPROPERTY('IsClustered') AS is_clustered,
    SERVERPROPERTY('IsHadrEnabled') AS is_hadr_enabled,
    si.sqlserver_start_time,
    si.cpu_count,
    si.scheduler_count,
    si.physical_memory_kb / 1024 AS physical_memory_mb
FROM sys.dm_os_sys_info AS si;

Compare the build with Microsoft’s current SQL Server update information before scheduling patch work.

Security

Sysadmin access and elevated logins

Elevated access should be boring and explainable. Review sysadmin membership, server roles, disabled logins, and old accounts. Remove access through a controlled change, especially when shared accounts or application logins have more rights than they need.

Server roles and elevated logins

List server role membership, including sysadmin and other elevated roles.

SELECT
    roles.name AS server_role,
    members.name AS login_name,
    members.type_desc AS login_type,
    members.is_disabled
FROM sys.server_role_members AS srm
JOIN sys.server_principals AS roles
    ON roles.principal_id = srm.role_principal_id
JOIN sys.server_principals AS members
    ON members.principal_id = srm.member_principal_id
ORDER BY roles.name, members.name;

Review shared accounts and application logins carefully before removing access.

Security

Service accounts, proxies, and credentials

Service accounts and SQL Agent proxies are where automation risk often hides. Check which accounts run SQL Server services, which proxies and credentials exist, and which jobs depend on them. Document usage before changing passwords or permissions.

Service accounts, proxies, and credentials

Review SQL Server services, SQL Agent proxies, credentials, and proxy subsystem mappings.

SELECT
    servicename,
    startup_type_desc,
    status_desc,
    service_account,
    last_startup_time
FROM sys.dm_server_services
ORDER BY servicename;

SELECT
    c.name AS credential_name,
    c.credential_identity,
    p.name AS proxy_name,
    p.enabled AS proxy_enabled,
    s.subsystem
FROM sys.credentials AS c
LEFT JOIN msdb.dbo.sysproxies AS p
    ON p.credential_id = c.credential_id
LEFT JOIN msdb.dbo.sysproxysubsystem AS ps
    ON ps.proxy_id = p.proxy_id
LEFT JOIN msdb.dbo.syssubsystems AS s
    ON s.subsystem_id = ps.subsystem_id
ORDER BY c.name, p.name, s.subsystem;

Credential metadata is visible, but secret values are not. Document usage before changing accounts.

Security

Unsafe enabled features and linked servers

Optional features and linked servers should have a current reason to exist. Some are legitimate in specific environments, but they should not be enabled just because an old job once needed them. Review enabled features, linked-server targets, and remote login mappings.

Enabled features and linked servers

Check selected surface-area features and linked-server configuration.

SELECT
    name,
    value_in_use,
    description
FROM sys.configurations
WHERE name IN (
    N'xp_cmdshell',
    N'Ole Automation Procedures',
    N'clr enabled',
    N'external scripts enabled',
    N'Ad Hoc Distributed Queries'
)
ORDER BY name;

SELECT
    name AS linked_server_name,
    product,
    provider,
    data_source,
    is_linked,
    is_remote_login_enabled,
    is_rpc_out_enabled
FROM sys.servers
WHERE server_id <> 0
ORDER BY name;

Some features are legitimate when controlled. The finding is an enabled feature without a current reason or owner.

Monitoring

Monitoring alerts and routing

SQL Server Agent alerts and operators are not the whole monitoring setup, but they are a useful local check. Confirm that important alerts exist, notifications point to current operators, and Database Mail is configured. Then compare this with the external monitoring system if one exists.

SQL Agent alerts and routing

Review SQL Agent alerts, configured operators, and notification routing.

SELECT
    a.name AS alert_name,
    a.enabled,
    a.message_id,
    a.severity,
    a.delay_between_responses,
    a.last_occurrence_date,
    a.last_occurrence_time
FROM msdb.dbo.sysalerts AS a
ORDER BY a.enabled DESC, a.name;

SELECT
    a.name AS alert_name,
    o.name AS operator_name,
    o.email_address,
    n.notification_method
FROM msdb.dbo.sysnotifications AS n
JOIN msdb.dbo.sysalerts AS a
    ON a.id = n.alert_id
JOIN msdb.dbo.sysoperators AS o
    ON o.id = n.operator_id
ORDER BY a.name, o.name;

This is local SQL Agent alerting only. Compare it with your central monitoring system.

Monitoring

Alert noise and ignored warnings

Noisy alerts are a process problem more than a SQL metadata problem. Review repeated warnings, muted monitors, ignored tickets, and thresholds that no longer match the workload. If people have learned to ignore an alert, either tune it or turn it into real fix work.

HA / DR

HA/DR status and replica health

HA/DR health needs more than a green dashboard. Check availability group, replica, synchronization, send queue, redo queue, and failover mode where Always On is used. If the server does not use availability groups, record the actual DR mechanism instead of forcing this query to matter.

Always On availability group status

Review availability group and replica state where Always On is used.

SELECT
    ag.name AS availability_group_name,
    ar.replica_server_name,
    ars.role_desc,
    ars.connected_state_desc,
    ars.synchronization_health_desc,
    ar.availability_mode_desc,
    ar.failover_mode_desc
FROM sys.availability_groups AS ag
JOIN sys.availability_replicas AS ar
    ON ar.group_id = ag.group_id
LEFT JOIN sys.dm_hadr_availability_replica_states AS ars
    ON ars.replica_id = ar.replica_id
ORDER BY ag.name, ar.replica_server_name;

SELECT
    DB_NAME(drs.database_id) AS database_name,
    drs.synchronization_state_desc,
    drs.synchronization_health_desc,
    drs.database_state_desc,
    drs.log_send_queue_size,
    drs.redo_queue_size
FROM sys.dm_hadr_database_replica_states AS drs
ORDER BY database_name;

If the instance does not use availability groups, this should return no rows. Review the actual DR mechanism instead.

HA / DR

Failover readiness: jobs, logins, DNS, app behavior

Failover readiness is partly SQL Server and partly application behavior. Jobs, logins, linked servers, credentials, DNS, firewall rules, listeners, and application validation all matter. The safest answer is a recent failover test that checked the application, not only the SQL role change.

Output

Final action list: fix now, schedule, review deeper, watch, accept

End the review by turning findings into work. Separate items that need action now from items that belong in the next maintenance window, deeper review, trend watching, or documented acceptance. A health check that ends as a neutral list is not finished.

What to do after the SQL Server health check checklist

Do not change MAXDOP, memory, tempdb, backup schedules, index maintenance, or HA settings just because a checklist row looks suspicious. A checklist finds candidates. It does not replace workload review.

A useful review turns each finding into a clear next step: confirm the workload impact, understand the failure mode, choose the action, and schedule the change only when it still makes sense.

Decide based on the checks

Fix now

Fix in the next maintenance window

Review deeper

Watch for trend

Accept for now

Next step

Use the SQL Server health check guide when you need the read-only scripts and deeper DBA checks behind this checklist.

Use SQL Server health audit when backups, jobs, monitoring, configuration, and ownership all need review at the same time.