sql server hub / failover guide

SQL Server failover guide

The role change is the small part of SQL Server failover. The real test is whether data, connections, jobs, backups, and monitoring behave correctly on the other side.

Use this guide to check quorum, replicas, synchronization, listener behavior, jobs, backups, and validation output before a planned drill or incident decision.

Guide

Guide~10 min readUpdated 9 Jul 2026

LinkedIn X Email

01What SQL Server failover means
02SQL Server failover types and when to use them
03SQL Server failover cluster check
04What to check before SQL Server failover
05How to check Availability Group failover readiness
06How to check database synchronization before failover
07How to validate listener and application cutover
08What to verify after SQL Server failover
09How to run a planned SQL Server failover drill
10When forced failover is an incident decision
11When to request a SQL Server recovery readiness review

What SQL Server failover means

SQL Server failover means service ownership moves from one SQL Server node or replica to another. The useful question is not only whether the role changed. The useful question is whether users can keep working through the expected connection path after the change.

A good failover guide therefore follows the same order as the event: confirm the failover path, check the cluster and data state before the switch, validate application access after the switch, then confirm jobs, backups, monitoring, and the stay/failback decision.

Failover phase	What it has to prove
Before failover	Quorum, node visibility, replica role, synchronization health, queues, and failover readiness are acceptable.
During failover	The intended AG replica or FCI node becomes the owner, and the decision is recorded.
After failover	Listener, DNS, drivers, application workflows, jobs, backups, and monitoring work from the new owner.
After the drill	Timing, errors, manual steps, stay/failback decision, and follow-up fixes are documented.

SQL Server failover types and when to use them

The checks depend on the kind of failover. A planned AG failover, automatic AG failover, FCI node move, forced AG failover, and restore-based recovery have different triggers, validation steps, and risks.

Restore-based recovery is included here as a boundary, not as a failover type. It is the fallback when live failover is not available, not healthy, or not the safest recovery path.

Planned Availability Group failover

Trigger: A maintenance window, patch, hardware change, or drill where the target secondary is synchronous and ready.
What changes: The Availability Group role moves to the target replica. The listener should follow the new primary.
Check after: Confirm replica role, synchronization health, listener connectivity, application login, jobs, backups, and monitoring alerts.
What can go wrong: The role switch can succeed while applications still fail because of DNS, drivers, jobs, permissions, or validation steps.

Automatic Availability Group failover

Trigger: A qualifying failure on a synchronous automatic-failover pair, with WSFC quorum and AG health detection agreeing that failover is allowed.
What changes: WSFC and the Availability Group configuration allow failover to a synchronized automatic-failover partner without a manual `ALTER AVAILABILITY GROUP` command.
Check after: Confirm which replica became primary, whether all databases came online, whether the listener moved, and whether alerts fired correctly.
What can go wrong: Automatic failover is only as good as the health checks, quorum design, application retry behavior, and post-role-change operations.

Failover Cluster Instance node failover

Trigger: Planned node maintenance or an FCI owner-node failure.
What changes: WSFC moves the SQL Server instance resource group to another node. The instance restarts on the new owner and keeps the same instance identity.
Check after: Confirm SQL Server service state, clustered resources, storage, SQL Agent, logins, linked paths, jobs, and application connection tests.
What can go wrong: The instance can come online while storage paths, service accounts, jobs, or node-specific dependencies still behave differently.

Forced failover with possible data loss

Trigger: An incident where the primary cannot be recovered quickly and the business accepts possible data loss.
What changes: A secondary is forced into the primary role even if not all log records are hardened.
Check after: Confirm the chosen replica, estimated data-loss exposure, application impact, old-primary isolation, and reconciliation plan.
What can go wrong: This can lose committed work from the old primary and can create split-brain or reconciliation problems if the old primary returns unmanaged.

Restore-based recovery instead of HA failover

Trigger: No working AG or FCI target exists, HA is not healthy, or the safer path is restore from backups to another server.
What changes: Service is recovered from full, differential, and log backups instead of a live role switch.
Check after: Confirm backup chain, restore sequence, log tail availability, connection cutover, jobs, monitoring, and data-loss timing.
What can go wrong: Recovery time is usually longer, and the real limit is backup quality, restore test run, DNS/client cutover, and application validation.

SQL Server failover cluster check

A SQL Server failover cluster check depends on the design. For an FCI, check the WSFC role, possible owner nodes, shared storage, virtual network name, IP resources, SQL Server service state, SQL Agent service state, and client connection path. For an Availability Group, check the WSFC health plus AG replicas, synchronization, listeners, backup preference, and database readiness.

Do not stop at “the cluster is up.” A planned failover needs the cluster, SQL Server, storage or replica state, DNS, listeners, jobs, backups, alerts, and application smoke tests to agree with the failover plan.

Check	FCI	Availability Group
Cluster state	Cluster service, quorum, node status, SQL Server role owner, possible owners, and resource dependencies.	Cluster service, quorum, node status, AG resource health, and HADR manager state.
Data path	Shared storage online, disk ownership, mount points, file paths, and storage latency after move.	Replica synchronization state, log send queue, redo queue, and database failover readiness.
Connection path	FCI virtual network name, virtual IP, DNS, port, driver behavior, and application smoke test.	AG listener, listener IPs, DNS, port, routing behavior, and application smoke test.
Operations	SQL Agent jobs, backups, Database Mail, monitoring, maintenance, and alerts on the new owner.	Backup preference, replica-local jobs, alerts, monitoring, maintenance, and runbook notes.

What to check before SQL Server failover

Before a planned failover, first check whether SQL Server can see the cluster, quorum, and nodes it depends on. On Windows, Availability Groups and FCIs depend on Windows Server Failover Clustering, so quorum or node problems make every later check less trustworthy.

How to check SQL Server HADR and cluster state

Checks AG/HADR state, clustered instance state, quorum, visible nodes, and cluster properties.

SELECT
    SERVERPROPERTY('ServerName') AS server_name,
    SERVERPROPERTY('MachineName') AS machine_name,
    SERVERPROPERTY('IsClustered') AS is_clustered,
    SERVERPROPERTY('IsHadrEnabled') AS is_hadr_enabled,
    SERVERPROPERTY('HadrManagerStatus') AS hadr_manager_status;

SELECT
    cluster_name,
    quorum_type_desc,
    quorum_state_desc
FROM sys.dm_hadr_cluster;

SELECT
    NodeName,
    status_description
FROM sys.dm_os_cluster_nodes;

SELECT *
FROM sys.dm_os_cluster_properties;

IsHadrEnabled and HadrManagerStatus show whether AG functionality is enabled and running. quorum_state_desc should be healthy before planned AG failover. dm_os_cluster_nodes is useful for FCI node visibility, but FCI failover readiness also needs WSFC role, resource, storage, and service checks outside these AG DMVs. For an Always On-enabled instance, no rows from dm_hadr_cluster can mean the local WSFC node does not currently have quorum.

How to check Availability Group failover readiness

For Availability Groups, failover readiness starts with the replica configuration. Confirm the current primary, the intended target, the availability mode, failover mode, connection state, and synchronization health before treating the target as a candidate.

How to inspect Availability Group configuration

Lists AG settings, primary replica, replica modes, failover mode, session timeout, and backup preference.

SELECT
    ag.name AS availability_group_name,
    ag.automated_backup_preference_desc,
    ag.failure_condition_level,
    ag.health_check_timeout,
    ags.primary_replica,
    ags.primary_recovery_health_desc,
    ags.secondary_recovery_health_desc,
    ags.synchronization_health_desc,
    ar.replica_server_name,
    ar.availability_mode_desc,
    ar.failover_mode_desc,
    ar.session_timeout,
    ar.backup_priority,
    ar.primary_role_allow_connections_desc,
    ar.secondary_role_allow_connections_desc
FROM sys.availability_groups AS ag
LEFT JOIN sys.dm_hadr_availability_group_states AS ags
    ON ags.group_id = ag.group_id
LEFT JOIN sys.availability_replicas AS ar
    ON ar.group_id = ag.group_id
ORDER BY ag.name, ar.replica_server_name;

primary_replica shows the current owner. availability_mode_desc and failover_mode_desc show whether automatic failover is even possible. automated_backup_preference_desc affects where backup jobs should run. health_check_timeout and failure_condition_level explain part of automatic failover behavior.

How to check replica role and health

Shows local/remote replica role, operational state, connection state, recovery health, and synchronization health.

SELECT
    ag.name AS availability_group_name,
    ar.replica_server_name,
    ars.is_local,
    ars.role_desc,
    ars.operational_state_desc,
    ars.connected_state_desc,
    ars.recovery_health_desc,
    ars.synchronization_health_desc,
    ars.last_connect_error_number,
    ars.last_connect_error_description,
    ars.last_connect_error_timestamp
FROM sys.dm_hadr_availability_replica_states AS ars
JOIN sys.availability_replicas AS ar
    ON ar.replica_id = ars.replica_id
JOIN sys.availability_groups AS ag
    ON ag.group_id = ar.group_id
ORDER BY ag.name, ar.replica_server_name;

role_desc should match the expected primary and secondary layout. connected_state_desc should be CONNECTED before planned failover. synchronization_health_desc should be healthy for normal drills. last_connect_error fields help explain recent replica communication problems.

How to check database synchronization before failover

Database-level synchronization is where failover risk becomes concrete. A replica can be connected while one database is suspended, behind on redo, or not ready for failover.

How to check database synchronization queues

Shows synchronization state, health, database state, suspend reason, log send queue, and redo queue.

SELECT
    ag.name AS availability_group_name,
    ar.replica_server_name,
    COALESCE(adc.database_name, DB_NAME(drs.database_id)) AS database_name,
    drs.is_local,
    drs.synchronization_state_desc,
    drs.synchronization_health_desc,
    drs.database_state_desc,
    drs.is_suspended,
    drs.suspend_reason_desc,
    drs.log_send_queue_size,
    drs.log_send_rate,
    drs.redo_queue_size,
    drs.redo_rate,
    drs.last_commit_time
FROM sys.dm_hadr_database_replica_states AS drs
JOIN sys.availability_replicas AS ar
    ON ar.replica_id = drs.replica_id
JOIN sys.availability_groups AS ag
    ON ag.group_id = ar.group_id
LEFT JOIN sys.availability_databases_cluster AS adc
    ON adc.group_database_id = drs.group_database_id
ORDER BY ag.name, database_name, ar.replica_server_name;

synchronization_state_desc should fit the planned failover mode. is_suspended or a suspend reason needs investigation before failover. Large log_send_queue_size or redo_queue_size can affect data-loss exposure and recovery time. last_commit_time helps compare activity timing across replicas.

How to check failover readiness by database

Uses cluster replica state to show whether each AG database is ready for failover.

SELECT
    ag.name AS availability_group_name,
    ar.replica_server_name,
    dcs.database_name,
    dcs.is_failover_ready,
    dcs.is_pending_secondary_suspend,
    dcs.is_database_joined,
    dcs.recovery_lsn,
    dcs.truncation_lsn
FROM sys.dm_hadr_database_replica_cluster_states AS dcs
JOIN sys.availability_replicas AS ar
    ON ar.replica_id = dcs.replica_id
JOIN sys.availability_groups AS ag
    ON ag.group_id = ar.group_id
ORDER BY ag.name, dcs.database_name, ar.replica_server_name;

is_failover_ready should be 1 for normal planned failover candidates. is_database_joined confirms whether the database is joined on the replica. Pending secondary suspend needs review before role change. Use this before failover drills and before incident decisions.

How to validate listener and application cutover

A successful role change is not enough if applications cannot connect afterward. Validate the listener path, DNS behavior, driver settings, connection strings, read-only routing, and real application workflows through the same path users use.

How to inspect listener state

Shows AG listener DNS name, port, IP state, TCP listener state, and cluster IP configuration.

SELECT
    ag.name AS availability_group_name,
    agl.dns_name,
    agl.port,
    agl.is_conformant,
    agl.ip_configuration_string_from_cluster,
    ip.ip_address,
    ip.state_desc AS ip_state_desc,
    tls.type_desc AS listener_type_desc,
    tls.state_desc AS tcp_listener_state_desc,
    tls.start_time
FROM sys.availability_group_listeners AS agl
JOIN sys.availability_groups AS ag
    ON ag.group_id = agl.group_id
LEFT JOIN sys.availability_group_listener_ip_addresses AS ip
    ON ip.listener_id = agl.listener_id
LEFT JOIN sys.dm_tcp_listener_states AS tls
    ON tls.listener_id = agl.listener_id
ORDER BY ag.name, agl.dns_name, ip.ip_address;

dns_name and port should match application connection strings. ip_state_desc and tcp_listener_state_desc should be online for active listener paths. Multi-subnet listeners need matching client driver and connection-string behavior. Validate applications through the listener after a drill.

Field	What to check
quorum_type_desc	The witness/quorum model. Confirm it matches the expected cluster design.
quorum_state_desc	Whether quorum is healthy enough for planned failover work.
availability_mode_desc	Whether the replica is synchronous or asynchronous.
failover_mode_desc	Whether failover is automatic or manual.
synchronization_state_desc	Whether each database is synchronized, synchronizing, or not synchronized.
synchronization_health_desc	Whether SQL Server considers replica or database synchronization healthy.
log_send_queue_size	How much log still has to be sent to the secondary.
redo_queue_size	How much received log still has to be redone on the secondary.
is_failover_ready	Whether the database is ready for normal AG failover.
dns_name	The listener name applications should use.
ip_state_desc	Whether listener IP resources are online.
tcp_listener_state_desc	Whether SQL Server is listening on the expected listener path.
is_preferred_backup_replica	Whether this replica should run backup jobs under AG backup preference rules.

What to verify after SQL Server failover

After failover, the new primary may need to run work that previously ran somewhere else. Backup jobs, maintenance, monitoring, reporting, and alerts must either follow the role or deliberately stay tied to a node.

How to check preferred backup replica

Checks whether the current replica is preferred for backups for each AG database.

SELECT
    ag.name AS availability_group_name,
    adc.database_name,
    ag.automated_backup_preference_desc,
    sys.fn_hadr_backup_is_preferred_replica(adc.database_name) AS is_preferred_backup_replica
FROM sys.availability_groups AS ag
JOIN sys.availability_databases_cluster AS adc
    ON adc.group_id = ag.group_id
ORDER BY ag.name, adc.database_name;

is_preferred_backup_replica returns 1 when the current replica is the preferred backup location. Backup jobs should check this when AG backup preference is used. A failover can change which replica should run backups. Do not assume backups are healthy after role change; verify job output.

How to review SQL Agent jobs around failover

Reads SQL Agent history for jobs likely to matter around failover.

SELECT TOP (120)
    j.name AS job_name,
    j.enabled,
    h.step_id,
    h.step_name,
    h.run_status,
    DATETIMEFROMPARTS(
        h.run_date / 10000,
        (h.run_date % 10000) / 100,
        h.run_date % 100,
        h.run_time / 10000,
        (h.run_time % 10000) / 100,
        h.run_time % 100,
        0
    ) AS run_start_time,
    ((h.run_duration / 10000) * 3600)
      + (((h.run_duration % 10000) / 100) * 60)
      + (h.run_duration % 100) AS run_duration_seconds,
    h.message
FROM msdb.dbo.sysjobs AS j
LEFT JOIN msdb.dbo.sysjobhistory AS h
    ON h.job_id = j.job_id
WHERE h.run_date >= CONVERT(int, CONVERT(char(8), DATEADD(day, -14, SYSDATETIME()), 112))
  AND (
      j.name LIKE N'%backup%'
      OR j.name LIKE N'%log%'
      OR j.name LIKE N'%monitor%'
      OR j.name LIKE N'%alert%'
      OR j.name LIKE N'%maintenance%'
      OR j.name LIKE N'%checkdb%'
      OR h.step_name LIKE N'%backup%'
      OR h.step_name LIKE N'%monitor%'
  )
ORDER BY run_start_time DESC, j.name, h.step_id;

run_status identifies failed or retrying job steps. message often shows path, permission, replica, or preferred-backup problems. run_duration_seconds helps compare drill timing and operational load. Run this before and after failover drills.

How to run a planned SQL Server failover drill

A failover drill should measure the role change, listener behavior, application validation, job behavior, backup behavior, monitoring alerts, and the decision to stay or fail back.

How to run a planned AG failover

Non-executing template for a planned AG failover after readiness checks pass.

-- Template only. Use during a planned drill after readiness checks pass.
-- Run on the target synchronous secondary replica.

ALTER AVAILABILITY GROUP [YourAvailabilityGroup] FAILOVER;

Run this only on the intended target synchronous secondary. Confirm synchronization health and failover readiness first. Application owners should be ready to validate immediately afterward. Record timing and errors during the drill.

How to decide on forced failover

Decision checklist for incidents where forced failover is being considered.

Forced failover is an incident decision, not a normal test step.

Before considering it:
1. Confirm the primary is unavailable or cannot be trusted.
2. Confirm which secondary has the most recent hardened data.
3. Confirm data-loss exposure with the business owner.
4. Record the decision, target replica, and validation plan.
5. Reconcile or rebuild the old primary after service is restored.

Forced failover can lose data. Use it as an incident decision, not a normal test. The target replica choice should be based on hardened data and business impact. The old primary needs reconciliation before it can safely return.

How to document a failover drill

Plain checklist for the failover drill, validation, and stay/failback decision.

Failover drill checklist

1. Announce the drill window and expected impact.
2. Confirm quorum, replica health, synchronization state, and failover readiness.
3. Confirm backup jobs, monitoring, and application owners are ready.
4. Fail over to the target replica or node.
5. Verify listener, DNS, connection strings, and application workflows.
6. Check SQL Agent jobs, backups, alerts, and monitoring after role change.
7. Decide whether to stay on the new primary or fail back.
8. Record timing, errors, manual steps, and follow-up fixes.

Each step should have an owner and expected duration. Application validation should happen through the same path users use. Backup and monitoring checks belong in the drill, not later. Follow-up fixes are part of the drill output.

When forced failover is an incident decision

Forced failover belongs in incident handling, not routine testing. Use it only when the current primary cannot be trusted or recovered quickly enough, and the business accepts the data-loss and reconciliation risk.

Do not fail over before checking synchronization health and business impact.

Do not treat AG or FCI as a backup replacement.

Do not assume automatic failover is safe because it is configured.

Do not ignore quorum, witness, listener, DNS, and client driver behavior.

Do not patch, reboot, or fail over without knowing which jobs and backups run on each replica.

Do not force failover unless the data-loss and recovery consequences are understood.

Do not leave backup jobs, monitoring, or maintenance pointing only at the old primary.

Do not call HA ready until application validation passes after a drill.

When to request a SQL Server recovery readiness review

A recovery readiness review makes sense when the AG or FCI exists but failover has not been tested, quorum behavior is unclear, listener behavior is fragile, backups depend on replica role, or the application validation path is not written down.

Send AG and FCI configuration, quorum output, replica health, synchronization state, listener state, job history, backup behavior, monitoring alerts, failover drill output, and the main business target for recovery time and data loss.

Request recovery readiness review Read the recovery guide

Next step

If failover behavior, backup preference, listener routing, or validation is not clear yet, use the SQL Server recovery readiness page or request HA review above.

Next useful reads: the SQL Server recovery guide for restore and rollback, the SQL Server monitoring guide for ongoing checks, the SQL Server backup guide for backup behavior, and the SQL Server sizing guide for workload fit.