sql server hub / health check guide

SQL Serverhealth check guide

A SQL Server health check should show what is healthy, what is weak, what is missing, and what should be fixed first.

Use this guide to check version, configuration, files, tempdb, backups, restores, jobs, waits, blocking, error logs, access, HA, and monitoring with concrete SQL Server output.

Guide

Guide~12 min readUpdated 19 Apr 2026

Share

LinkedInXEmail
  1. 01What a SQL Server health check should cover
  2. 02When to run a SQL Server health check
  3. 03How to start a SQL Server health check
  4. 04SQL Server health check for version, uptime, and configuration
  5. 05SQL Server health check for databases and files
  6. 06SQL Server health check for tempdb and storage latency
  7. 07SQL Server health check for backups and restore history
  8. 08SQL Server health check for SQL Agent jobs and maintenance
  9. 09SQL Server health check for waits, blocking, and active requests
  10. 10SQL Server health check for error logs and corruption signals
  11. 11SQL Server health check for access, security, and ownership
  12. 12SQL Server health check for HA, DR, and monitoring coverage
  13. 13What not to change after a SQL Server health check
  14. 14When to request a SQL Server health audit

What a SQL Server health check should cover

A SQL Server health check should move from context to risk. First understand the instance, then check recovery and routine operations, then read workload symptoms, then decide what needs action.

It is not a generic score, and it is not a full performance tuning engagement. The useful result is a short findings list backed by current SQL Server output, with enough context to decide what is healthy, what is weak, what is missing, and what needs a planned change.

Baseline state

Start with the facts that make every later reading useful: SQL Server version, uptime, edition, configuration, databases, file layout, growth settings, and tempdb shape.

Recovery and routine operations

Check whether backups, restore history, SQL Agent jobs, CHECKDB, maintenance, cleanup, and monitoring are actually running in a way production can rely on.

Current workload and failure clues

Use waits, active requests, blocking, file latency, tempdb use, error logs, and suspect pages to see where the server is currently under pressure or quietly failing.

Access, HA, and follow-up

Finish by checking privileged access, ownership, HA/DR state, alert coverage, and what should happen next. The output should be a short list of findings, not a pile of raw DMV results.

When to run a SQL Server health check

Run a health check when the SQL Server state matters for a decision, not only when something is already broken. The best timing is usually before a risky change, after a meaningful change, or when nobody can explain the current setup cleanly.

Some checks are ad hoc: a one-time review before a migration, patch, upgrade, failover drill, handover, or incident review. Others should be regular: monthly or quarterly checks for backups, jobs, CHECKDB, file growth, waits, access, patch level, and monitoring gaps.

New or handed-over environment

Use a health check before trusting old jobs, backup history, access, monitoring, or HA notes.

Migration or upgrade planning

Check version, compatibility, databases, jobs, linked dependencies, backups, restore path, and rollback risk before the work starts.

Patching and cumulative updates

Run a smaller pre-check before patching and a post-check after patching to catch failed jobs, service changes, errors, or AG issues.

After configuration or infrastructure changes

Review files, storage latency, tempdb, waits, error logs, jobs, and monitoring after moving storage, changing memory, changing service accounts, or modifying HA.

After incidents or repeated slow periods

Use the health check to separate current symptoms from older drift: failed jobs, stale backups, high waits, blocking, errors, or missing alerts.

Regular operations review

Run a lighter monthly or quarterly check so backup, CHECKDB, job, storage, access, patch, and monitoring problems do not wait for the next emergency.

How to start a SQL Server health check

Start by collecting read-only facts before changing anything. A health check needs enough context to separate configuration drift, missing operations, current workload pressure, and design work that belongs in a larger project.

Run these checks from an account with enough metadata access. Some DMVs require `VIEW SERVER STATE`, SQL Server 2022 and newer can require `VIEW SERVER PERFORMANCE STATE` or `VIEW SERVER SECURITY STATE`, error-log access can require `VIEW ANY ERROR LOG`, and `msdb` history depends on permissions. Low-permission output can look cleaner than the server really is.

StepWhat to do
1. Set the reasonDecide whether this is regular review, pre-change, post-change, incident follow-up, migration, patching, or a handover check.
2. Collect the baselineCapture version, uptime, configuration, databases, files, tempdb, and storage readings before interpreting symptoms.
3. Check recovery and operationsRead backup history, restore history, SQL Agent jobs, CHECKDB, maintenance, cleanup, and monitoring gaps.
4. Read the workloadReview waits, active requests, blocking, file latency, tempdb use, error logs, and suspect pages with timing context.
5. Finish with control pointsCheck privileged access, ownership, HA/DR state, alert coverage, and what needs a quick fix or planned change.

SQL Server health check for version, uptime, and configuration

Version, edition, uptime, CPU count, memory, cluster state, and HADR state frame the rest of the review. Configuration checks should focus on resource control and operational behavior, not changing settings because a generic report says so.

How to check SQL Server version and uptime

Reads server identity, version, edition, startup time, CPU, memory, clustered state, and HADR state.

SELECT
    SERVERPROPERTY('ServerName') AS server_name,
    SERVERPROPERTY('MachineName') AS machine_name,
    SERVERPROPERTY('Edition') AS edition,
    SERVERPROPERTY('ProductVersion') AS product_version,
    SERVERPROPERTY('ProductLevel') AS product_level,
    SERVERPROPERTY('ProductUpdateLevel') AS product_update_level,
    SERVERPROPERTY('IsClustered') AS is_clustered,
    SERVERPROPERTY('IsHadrEnabled') AS is_hadr_enabled,
    SERVERPROPERTY('HadrManagerStatus') AS hadr_manager_status,
    si.sqlserver_start_time,
    si.cpu_count,
    si.scheduler_count,
    si.physical_memory_kb / 1024 AS physical_memory_mb,
    si.committed_kb / 1024 AS sql_committed_mb,
    si.committed_target_kb / 1024 AS sql_target_mb
FROM sys.dm_os_sys_info AS si;

sqlserver_start_time tells you how long DMV counters have been accumulating. cpu_count, scheduler_count, and memory values help interpret pressure checks. IsClustered and IsHadrEnabled show whether cluster or AG checks may apply. Use version and update level when assessing support and patch posture.

How to check SQL Server configuration

Reads common server settings and whether configured values match running values.

SELECT
    name,
    value,
    value_in_use,
    is_dynamic,
    is_advanced,
    description
FROM sys.configurations
WHERE name IN (
    N'max server memory (MB)',
    N'min server memory (MB)',
    N'max degree of parallelism',
    N'cost threshold for parallelism',
    N'backup compression default',
    N'optimize for ad hoc workloads',
    N'remote admin connections',
    N'xp_cmdshell'
)
ORDER BY name;

value and value_in_use can differ when a setting has not taken effect. max server memory should leave room for the operating system and other services. MAXDOP and cost threshold need workload context before changing. xp_cmdshell and remote admin connections deserve a deliberate access decision.

SQL Server health check for databases and files

Database state, recovery model, compatibility level, Query Store, ownership, size, file layout, and growth settings show what the instance is responsible for. They also decide which recovery, capacity, and security checks matter most.

How to inventory databases

Lists databases with state, recovery model, compatibility level, Query Store flag, owner, size, and file count.

SELECT
    d.name AS database_name,
    d.state_desc,
    d.recovery_model_desc,
    d.compatibility_level,
    d.is_read_only,
    d.is_query_store_on,
    SUSER_SNAME(d.owner_sid) AS database_owner,
    CONVERT(decimal(18,2), SUM(mf.size) * 8.0 / 1024.0) AS size_mb,
    COUNT(mf.file_id) AS file_count
FROM sys.databases AS d
JOIN sys.master_files AS mf
    ON mf.database_id = d.database_id
WHERE d.name <> N'tempdb'
GROUP BY
    d.name,
    d.state_desc,
    d.recovery_model_desc,
    d.compatibility_level,
    d.is_read_only,
    d.is_query_store_on,
    d.owner_sid
ORDER BY size_mb DESC;

state_desc should be online for normal production databases. recovery_model_desc affects backup and restore expectations. compatibility_level can affect query behavior and upgrade planning. Large or critical databases deserve separate recovery and performance checks.

How to review database files and growth settings

Lists database file paths, sizes, human-readable max size, and growth settings.

SELECT
    DB_NAME(mf.database_id) AS database_name,
    mf.type_desc,
    mf.name AS logical_file_name,
    mf.physical_name,
    CONVERT(decimal(18,2), mf.size * 8.0 / 1024.0) AS size_mb,
    CASE
        WHEN mf.is_percent_growth = 1 THEN CONCAT(mf.growth, N'%')
        ELSE CONCAT(CONVERT(decimal(18,2), mf.growth * 8.0 / 1024.0), N' MB')
    END AS growth_setting,
    CASE
        WHEN mf.max_size = -1 THEN N'Unlimited'
        WHEN mf.max_size = 0 THEN N'No growth'
        WHEN mf.type_desc = N'LOG' AND mf.max_size = 268435456 THEN N'2 TB'
        ELSE CONCAT(CONVERT(decimal(18,2), mf.max_size * 8.0 / 1024.0), N' MB')
    END AS max_size_setting,
    mf.is_percent_growth
FROM sys.master_files AS mf
WHERE DB_NAME(mf.database_id) <> N'tempdb'
ORDER BY database_name, mf.type_desc, mf.file_id;

physical_name shows where data and log files actually live. max_size_setting converts SQL Server page counts into readable limits. Percent growth can create unpredictable growth events on large files. Data and log files usually need different review questions. File changes need free-space, backup, maintenance, and rollback context.

SQL Server health check for tempdb and storage latency

File latency and tempdb usage show whether storage and temporary workspace are part of the current health picture. Treat these readings as starting points, then compare them with workload timing, job windows, spills, version store use, and recent incidents.

How to check file latency

Reads file-level reads, writes, and average I/O stalls from sys.dm_io_virtual_file_stats.

SELECT
    DB_NAME(mf.database_id) AS database_name,
    mf.type_desc,
    mf.name AS logical_file_name,
    mf.physical_name,
    CONVERT(decimal(18,2), mf.size * 8.0 / 1024.0) AS size_mb,
    vfs.num_of_reads,
    vfs.num_of_writes,
    CONVERT(decimal(18,2), vfs.io_stall_read_ms * 1.0 / NULLIF(vfs.num_of_reads, 0)) AS avg_read_ms,
    CONVERT(decimal(18,2), vfs.io_stall_write_ms * 1.0 / NULLIF(vfs.num_of_writes, 0)) AS avg_write_ms,
    CONVERT(decimal(18,2), vfs.io_stall * 1.0 / NULLIF(vfs.num_of_reads + vfs.num_of_writes, 0)) AS avg_io_ms
FROM sys.master_files AS mf
CROSS APPLY sys.dm_io_virtual_file_stats(mf.database_id, mf.file_id) AS vfs
WHERE DB_NAME(mf.database_id) <> N'tempdb'
ORDER BY avg_io_ms DESC, database_name, mf.file_id;

avg_read_ms and avg_write_ms are starting points for storage review. Log files with high write latency need separate attention. High latency during backup or maintenance windows can mislead a one-time check. A single snapshot needs comparison with workload timing before changing storage.

How to check tempdb space usage

Run in tempdb to review file sizes, free space, user objects, internal objects, version store, and growth settings.

USE tempdb;

SELECT
    df.file_id,
    df.name AS logical_file_name,
    df.type_desc,
    df.physical_name,
    CONVERT(decimal(18,2), df.size * 8.0 / 1024.0) AS file_size_mb,
    CONVERT(decimal(18,2), fsu.unallocated_extent_page_count * 8.0 / 1024.0) AS free_space_mb,
    CONVERT(decimal(18,2), fsu.user_object_reserved_page_count * 8.0 / 1024.0) AS user_object_mb,
    CONVERT(decimal(18,2), fsu.internal_object_reserved_page_count * 8.0 / 1024.0) AS internal_object_mb,
    CONVERT(decimal(18,2), fsu.version_store_reserved_page_count * 8.0 / 1024.0) AS version_store_mb,
    CASE
        WHEN df.is_percent_growth = 1 THEN CONCAT(df.growth, N'%')
        ELSE CONCAT(CONVERT(decimal(18,2), df.growth * 8.0 / 1024.0), N' MB')
    END AS growth_setting
FROM tempdb.sys.database_files AS df
LEFT JOIN sys.dm_db_file_space_usage AS fsu
    ON fsu.file_id = df.file_id
ORDER BY df.file_id;

version_store_mb matters when row versioning or long transactions are active. internal_object_mb can rise with sorts, hashes, spools, and spills. growth_setting should be predictable and large enough for production use. Use this with the tempdb guide when pressure is recurring.

SQL Server health check for backups and restore history

A health check should read backup history beside restore history. Successful jobs matter, but they do not answer whether the files, log chain, permissions, keys, and restore timing are good enough.

How to check recent backups

Shows the latest non-copy-only full, latest copy-only full, differential, and log backup per database from msdb backup history.

WITH LastBackups AS (
    SELECT
        bs.database_name,
        bs.type,
        MAX(bs.backup_finish_date) AS last_backup_finish_date,
        MAX(CASE WHEN bs.is_copy_only = 0 THEN bs.backup_finish_date END) AS last_non_copy_only_finish_date,
        MAX(CASE WHEN bs.is_copy_only = 1 THEN bs.backup_finish_date END) AS last_copy_only_finish_date
    FROM msdb.dbo.backupset AS bs
    WHERE bs.type IN ('D', 'I', 'L')
    GROUP BY bs.database_name, bs.type
)
SELECT
    d.name AS database_name,
    d.recovery_model_desc,
    d.state_desc,
    MAX(CASE WHEN lb.type = 'D' THEN lb.last_non_copy_only_finish_date END) AS last_full_backup,
    MAX(CASE WHEN lb.type = 'D' THEN lb.last_copy_only_finish_date END) AS last_copy_only_full_backup,
    MAX(CASE WHEN lb.type = 'I' THEN lb.last_backup_finish_date END) AS last_differential_backup,
    MAX(CASE WHEN lb.type = 'L' THEN lb.last_backup_finish_date END) AS last_log_backup
FROM sys.databases AS d
LEFT JOIN LastBackups AS lb
    ON lb.database_name = d.name
WHERE d.name <> N'tempdb'
GROUP BY d.name, d.recovery_model_desc, d.state_desc
ORDER BY d.name;

last_full_backup excludes copy-only full backups so the normal backup chain is clearer. last_copy_only_full_backup can explain ad hoc backups that should not be mistaken for the usual differential base. last_log_backup matters for databases in full or bulk-logged recovery model. Backup history does not replace a real restore test.

How to check restore history

Shows recent restore operations, source backup timing, recovery flag, and backup file path.

SELECT TOP (100)
    rh.destination_database_name,
    rh.restore_date,
    rh.restore_type,
    rh.replace,
    rh.recovery,
    bs.database_name AS source_database_name,
    bs.backup_finish_date,
    bmf.physical_device_name
FROM msdb.dbo.restorehistory AS rh
LEFT JOIN msdb.dbo.backupset AS bs
    ON bs.backup_set_id = rh.backup_set_id
LEFT JOIN msdb.dbo.backupmediafamily AS bmf
    ON bmf.media_set_id = bs.media_set_id
ORDER BY rh.restore_date DESC;

restore_date shows whether restore testing has happened recently. destination_database_name separates test restores from production restores. physical_device_name helps identify which backup source was used. No restore history may mean the restore path has not been validated on this instance.

SQL Server health check for SQL Agent jobs and maintenance

SQL Agent history often shows the parts of SQL Server nobody is watching: failed backup jobs, stale CHECKDB, long maintenance windows, cleanup problems, and old job owners. Health checks should include this before recommending changes.

How to check SQL Agent job failures

Reads recent SQL Agent history for failed, long-running, backup, integrity, and maintenance jobs.

SELECT TOP (120)
    j.name AS job_name,
    j.enabled,
    SUSER_SNAME(j.owner_sid) AS job_owner,
    h.step_id,
    h.step_name,
    h.run_status,
    DATETIMEFROMPARTS(
        h.run_date / 10000,
        (h.run_date % 10000) / 100,
        h.run_date % 100,
        h.run_time / 10000,
        (h.run_time % 10000) / 100,
        h.run_time % 100,
        0
    ) AS run_start_time,
    ((h.run_duration / 10000) * 3600)
      + (((h.run_duration % 10000) / 100) * 60)
      + (h.run_duration % 100) AS run_duration_seconds,
    h.message
FROM msdb.dbo.sysjobs AS j
LEFT JOIN msdb.dbo.sysjobhistory AS h
    ON h.job_id = j.job_id
WHERE h.run_date >= CONVERT(int, CONVERT(char(8), DATEADD(day, -14, SYSDATETIME()), 112))
  AND (
      h.run_status <> 1
      OR h.run_duration >= 10000
      OR j.name LIKE N'%backup%'
      OR j.name LIKE N'%checkdb%'
      OR j.name LIKE N'%maintenance%'
  )
ORDER BY run_start_time DESC, j.name, h.step_id;

run_status identifies failed or retrying job steps. job_owner should be valid and intentional. Long runtime can matter when jobs overlap business hours or maintenance windows. Job failures outside the main application can still affect recovery and stability.

How to check CHECKDB history

Looks for recent CHECKDB or integrity-check job history in SQL Agent.

SELECT TOP (100)
    j.name AS job_name,
    j.enabled,
    h.step_id,
    h.step_name,
    h.run_status,
    DATETIMEFROMPARTS(
        h.run_date / 10000,
        (h.run_date % 10000) / 100,
        h.run_date % 100,
        h.run_time / 10000,
        (h.run_time % 10000) / 100,
        h.run_time % 100,
        0
    ) AS run_start_time,
    ((h.run_duration / 10000) * 3600)
      + (((h.run_duration % 10000) / 100) * 60)
      + (h.run_duration % 100) AS run_duration_seconds,
    h.message
FROM msdb.dbo.sysjobs AS j
JOIN msdb.dbo.sysjobhistory AS h
    ON h.job_id = j.job_id
WHERE h.run_date >= CONVERT(int, CONVERT(char(8), DATEADD(day, -30, SYSDATETIME()), 112))
  AND (
      j.name LIKE N'%checkdb%'
      OR j.name LIKE N'%integrity%'
      OR h.step_name LIKE N'%checkdb%'
      OR h.message LIKE N'%DBCC CHECKDB%'
  )
ORDER BY run_start_time DESC, j.name, h.step_id;

Job naming varies, so this is a starting point rather than a complete guarantee. run_status and message show whether the integrity job completed or reported errors. No rows may mean the job has another name, history was cleaned, or integrity checks are missing. Use job output files or monitoring data when SQL Agent history is incomplete.

SQL Server health check for waits, blocking, and active requests

Wait stats and active requests show where SQL Server has been spending time and what is happening now. They should point to the next check, not become a verdict by themselves.

How to check top wait stats

Lists top cumulative waits while excluding common background waits.

SELECT TOP (25)
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    signal_wait_time_ms,
    wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms,
    CONVERT(decimal(18,2), wait_time_ms / 1000.0) AS wait_seconds
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
    N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP',
    N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE',
    N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'LAZYWRITER_SLEEP',
    N'LOGMGR_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE',
    N'SERVER_IDLE_CHECK', N'SLEEP_TASK', N'SLEEP_SYSTEMTASK',
    N'SQLTRACE_BUFFER_FLUSH', N'WAITFOR', N'XE_DISPATCHER_WAIT',
    N'XE_TIMER_EVENT'
)
ORDER BY wait_time_ms DESC;

wait_seconds helps identify the largest cumulative waits since startup. signal_wait_time_ms can point toward scheduler pressure. High wait totals need workload timing and baseline context. Do not reset wait stats only to make a report look cleaner.

How to check active requests and blocking

Shows current requests with blocking session, wait details, resource use, database, and statement text.

SELECT
    r.session_id,
    s.login_name,
    s.host_name,
    s.program_name,
    r.status,
    r.command,
    r.blocking_session_id,
    r.wait_type,
    r.wait_time,
    r.wait_resource,
    r.cpu_time,
    r.logical_reads,
    r.reads,
    r.writes,
    DB_NAME(r.database_id) AS database_name,
    SUBSTRING(
        st.text,
        (r.statement_start_offset / 2) + 1,
        ((CASE r.statement_end_offset
            WHEN -1 THEN DATALENGTH(st.text)
            ELSE r.statement_end_offset
          END - r.statement_start_offset) / 2) + 1
    ) AS current_statement
FROM sys.dm_exec_requests AS r
JOIN sys.dm_exec_sessions AS s
    ON s.session_id = r.session_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) AS st
WHERE r.session_id <> @@SPID
ORDER BY r.blocking_session_id DESC, r.cpu_time DESC, r.logical_reads DESC;

blocking_session_id identifies live blocking chains. wait_type and wait_resource show what the request is waiting on right now. logical_reads, reads, writes, and cpu_time help prioritize active work. Use this with the blocking guide when blocking is active.

SQL Server health check for error logs and corruption signals

Error logs and suspect pages catch issues that do not always show up in a simple dashboard: I/O errors, login failures, failover messages, corruption warnings, failed recovery, and repeated service-level problems.

How to read recent SQL Server error log entries

Reads the current SQL Server error log for recent entries. Filter the output for errors, I/O, corruption, login failures, AG, failover, and startup messages.

EXEC sp_readerrorlog 0, 1;
EXEC sp_readerrorlog 0, 1, N'error';
EXEC sp_readerrorlog 0, 1, N'fail';
EXEC sp_readerrorlog 0, 1, N'I/O';
EXEC sp_readerrorlog 0, 1, N'checksum';

sp_readerrorlog reads SQL Server error logs; access may be restricted. Review severity and repetition, not only the newest line. Look for I/O, checksum, stack dump, login failure, recovery, AG, and failover patterns. A noisy log needs filtering, but do not ignore repeated infrastructure messages.

How to check suspect pages

Reads msdb suspect page history for recorded page-level problems.

SELECT
    DB_NAME(database_id) AS database_name,
    file_id,
    page_id,
    event_type,
    CASE event_type
        WHEN 1 THEN '823 error caused by an operating system CRC error or torn page'
        WHEN 2 THEN 'Bad checksum'
        WHEN 3 THEN 'Torn page'
        WHEN 4 THEN 'Restored'
        WHEN 5 THEN 'Repaired'
        WHEN 7 THEN 'Deallocated by DBCC'
    END AS event_type_desc,
    error_count,
    last_update_date
FROM msdb.dbo.suspect_pages
ORDER BY last_update_date DESC, database_name, file_id, page_id;

Rows in suspect_pages deserve immediate backup, CHECKDB, storage, and error-log review. event_type_desc gives the broad reason SQL Server recorded the page. Old rows may have been repaired or restored, but still explain past incidents. No rows does not replace CHECKDB history or storage monitoring.

SQL Server health check for access, security, and ownership

A health check is not a full hardening audit, but it should catch broad privilege, disabled login, sysadmin, service account, job owner, and ownership problems that affect production support.

How to review privileged access

Lists SQL, Windows, group, and external logins with disabled state, default database, and server role membership without using SQL Server 2017-only aggregation.

SELECT
    sp.name AS login_name,
    sp.type_desc,
    sp.is_disabled,
    sp.default_database_name,
    sp.create_date,
    sp.modify_date,
    role_principal.name AS server_role
FROM sys.server_principals AS sp
LEFT JOIN sys.server_role_members AS srm
    ON srm.member_principal_id = sp.principal_id
LEFT JOIN sys.server_principals AS role_principal
    ON role_principal.principal_id = srm.role_principal_id
WHERE sp.type IN ('S', 'U', 'G', 'E', 'X')
  AND sp.name NOT LIKE N'##MS_%'
ORDER BY
    CASE WHEN role_principal.name = N'sysadmin' THEN 0 ELSE 1 END,
    sp.is_disabled,
    sp.name,
    role_principal.name;

sysadmin membership should be short and explainable. External login and group principals can matter in newer or cloud-connected environments. Disabled logins can still indicate stale access management. Default databases should exist and be reachable. Service accounts and automation logins need ownership and purpose.

SQL Server health check for HA, DR, and monitoring coverage

HA and DR health depends on tested behavior: replica health, failover readiness, restore sequence, monitoring alerts, SQL Agent behavior, application validation, and ownership during an incident.

How to check Availability Group health

Shows AG primary, replica mode, failover mode, connection state, recovery health, and synchronization health.

SELECT
    ag.name AS availability_group_name,
    ags.primary_replica,
    ags.primary_recovery_health_desc,
    ags.secondary_recovery_health_desc,
    ags.synchronization_health_desc,
    ar.replica_server_name,
    ar.availability_mode_desc,
    ar.failover_mode_desc,
    ars.role_desc,
    ars.connected_state_desc,
    ars.recovery_health_desc,
    ars.synchronization_health_desc AS replica_synchronization_health_desc
FROM sys.availability_groups AS ag
LEFT JOIN sys.dm_hadr_availability_group_states AS ags
    ON ags.group_id = ag.group_id
LEFT JOIN sys.availability_replicas AS ar
    ON ar.group_id = ag.group_id
LEFT JOIN sys.dm_hadr_availability_replica_states AS ars
    ON ars.replica_id = ar.replica_id
ORDER BY ag.name, ar.replica_server_name;

primary_replica and role_desc show the current AG layout. availability_mode_desc and failover_mode_desc show what failover behavior is configured. connected_state_desc and synchronization health need review before relying on HA. No AG rows means this check does not apply to that instance.

Coverage checkWhat to confirm
MonitoringBackup failures, job failures, disk growth, error log severity, blocking, AG health, and service state generate alerts that someone reads.
RecoveryRestore sequence, file access, encryption keys, and timing have been tested recently enough for the business need.
HAFailover mode, synchronization health, listener behavior, jobs, backups, and application validation are understood.
OwnershipThe people responsible for alerts, access changes, maintenance, and incident decisions are known.

What not to change after a SQL Server health check

1

Do not treat a health check as a generic score.

2

Do not change server settings from one snapshot alone.

3

Do not reset wait stats casually in production.

4

Do not assume successful backup jobs mean recovery works.

5

Do not drop indexes or change maintenance from one report.

6

Do not ignore SQL Agent failures that look unrelated to the main application.

7

Do not call HA or DR healthy without failover or restore validation.

8

Do not turn a health check into a large refactor before ranking the risks.

When to request a SQL Server health audit

A SQL Server health audit makes sense when the server is changing, support ownership is unclear, backups have not been restored recently, incidents repeat, monitoring does not explain enough, or the next upgrade or migration needs a cleaner findings list.

Send version output, database inventory, configuration output, file and tempdb checks, backup history, restore history, job history, CHECKDB history, wait stats, blocking examples, error-log details, login list, AG health, monitoring screenshots, and the main production concern.

Next step

If the review needs findings, priorities, and a follow-up path, use the SQL Server health audit page or request the audit above.

Next useful reads: the SQL Server monitoring guide for ongoing checks, the SQL Server backup guide for recovery inputs, the SQL Server recovery guide for restore validation, the SQL Server hardening guide for security posture, and the SQL Server failover guide for HA checks.