sql server hub / failover guide

SQL Serverfailover guide

The role change is the small part of SQL Server failover. The real test is whether data, connections, jobs, backups, and monitoring behave correctly on the other side.

Use this guide to check quorum, replicas, synchronization, listener behavior, jobs, backups, and validation output before a planned drill or incident decision.

Guide

Guide~10 min readUpdated 19 Apr 2026

Share

LinkedInXEmail
  1. 01What SQL Server failover means
  2. 02SQL Server failover types and when to use them
  3. 03What to check before SQL Server failover
  4. 04How to check Availability Group failover readiness
  5. 05How to check database synchronization before failover
  6. 06How to validate listener and application cutover
  7. 07What to verify after SQL Server failover
  8. 08How to run a planned SQL Server failover drill
  9. 09When forced failover is an incident decision
  10. 10When to request a SQL Server recovery readiness review

What SQL Server failover means

SQL Server failover means service ownership moves from one SQL Server node or replica to another. The useful question is not only whether the role changed. The useful question is whether users can keep working through the expected connection path after the change.

A good failover guide therefore follows the same order as the event: confirm the failover path, check the cluster and data state before the switch, validate application access after the switch, then confirm jobs, backups, monitoring, and the stay/failback decision.

Failover phaseWhat it has to prove
Before failoverQuorum, node visibility, replica role, synchronization health, queues, and failover readiness are acceptable.
During failoverThe intended AG replica or FCI node becomes the owner, and the decision is recorded.
After failoverListener, DNS, drivers, application workflows, jobs, backups, and monitoring work from the new owner.
After the drillTiming, errors, manual steps, stay/failback decision, and follow-up fixes are documented.

SQL Server failover types and when to use them

The checks depend on the kind of failover. A planned AG failover, automatic AG failover, FCI node move, forced AG failover, and restore-based recovery have different triggers, validation steps, and risks.

Restore-based recovery is included here as a boundary, not as a failover type. It is the fallback when live failover is not available, not healthy, or not the safest recovery path.

Planned Availability Group failover

Trigger
A maintenance window, patch, hardware change, or drill where the target secondary is synchronous and ready.
What changes
The Availability Group role moves to the target replica. The listener should follow the new primary.
Check after
Confirm replica role, synchronization health, listener connectivity, application login, jobs, backups, and monitoring alerts.
What can go wrong
The role switch can succeed while applications still fail because of DNS, drivers, jobs, permissions, or validation steps.

Automatic Availability Group failover

Trigger
A qualifying failure on a synchronous automatic-failover pair, with WSFC quorum and AG health detection agreeing that failover is allowed.
What changes
WSFC and the Availability Group configuration allow failover to a synchronized automatic-failover partner without a manual `ALTER AVAILABILITY GROUP` command.
Check after
Confirm which replica became primary, whether all databases came online, whether the listener moved, and whether alerts fired correctly.
What can go wrong
Automatic failover is only as good as the health checks, quorum design, application retry behavior, and post-role-change operations.

Failover Cluster Instance node failover

Trigger
Planned node maintenance or an FCI owner-node failure.
What changes
WSFC moves the SQL Server instance resource group to another node. The instance restarts on the new owner and keeps the same instance identity.
Check after
Confirm SQL Server service state, clustered resources, storage, SQL Agent, logins, linked paths, jobs, and application connection tests.
What can go wrong
The instance can come online while storage paths, service accounts, jobs, or node-specific dependencies still behave differently.

Forced failover with possible data loss

Trigger
An incident where the primary cannot be recovered quickly and the business accepts possible data loss.
What changes
A secondary is forced into the primary role even if not all log records are hardened.
Check after
Confirm the chosen replica, estimated data-loss exposure, application impact, old-primary isolation, and reconciliation plan.
What can go wrong
This can lose committed work from the old primary and can create split-brain or reconciliation problems if the old primary returns unmanaged.

Restore-based recovery instead of HA failover

Trigger
No working AG or FCI target exists, HA is not healthy, or the safer path is restore from backups to another server.
What changes
Service is recovered from full, differential, and log backups instead of a live role switch.
Check after
Confirm backup chain, restore sequence, log tail availability, connection cutover, jobs, monitoring, and data-loss timing.
What can go wrong
Recovery time is usually longer, and the real limit is backup quality, restore test run, DNS/client cutover, and application validation.

What to check before SQL Server failover

Before a planned failover, first check whether SQL Server can see the cluster, quorum, and nodes it depends on. On Windows, Availability Groups and FCIs depend on Windows Server Failover Clustering, so quorum or node problems make every later check less trustworthy.

How to check SQL Server HADR and cluster state

Checks AG/HADR state, clustered instance state, quorum, visible nodes, and cluster properties.

SELECT
    SERVERPROPERTY('ServerName') AS server_name,
    SERVERPROPERTY('MachineName') AS machine_name,
    SERVERPROPERTY('IsClustered') AS is_clustered,
    SERVERPROPERTY('IsHadrEnabled') AS is_hadr_enabled,
    SERVERPROPERTY('HadrManagerStatus') AS hadr_manager_status;

SELECT
    cluster_name,
    quorum_type_desc,
    quorum_state_desc
FROM sys.dm_hadr_cluster;

SELECT
    NodeName,
    status_description
FROM sys.dm_os_cluster_nodes;

SELECT *
FROM sys.dm_os_cluster_properties;

IsHadrEnabled and HadrManagerStatus show whether AG functionality is enabled and running. quorum_state_desc should be healthy before planned AG failover. dm_os_cluster_nodes is useful for FCI node visibility, but FCI failover readiness also needs WSFC role, resource, storage, and service checks outside these AG DMVs. For an Always On-enabled instance, no rows from dm_hadr_cluster can mean the local WSFC node does not currently have quorum.

How to check Availability Group failover readiness

For Availability Groups, failover readiness starts with the replica configuration. Confirm the current primary, the intended target, the availability mode, failover mode, connection state, and synchronization health before treating the target as a candidate.

How to inspect Availability Group configuration

Lists AG settings, primary replica, replica modes, failover mode, session timeout, and backup preference.

SELECT
    ag.name AS availability_group_name,
    ag.automated_backup_preference_desc,
    ag.failure_condition_level,
    ag.health_check_timeout,
    ags.primary_replica,
    ags.primary_recovery_health_desc,
    ags.secondary_recovery_health_desc,
    ags.synchronization_health_desc,
    ar.replica_server_name,
    ar.availability_mode_desc,
    ar.failover_mode_desc,
    ar.session_timeout,
    ar.backup_priority,
    ar.primary_role_allow_connections_desc,
    ar.secondary_role_allow_connections_desc
FROM sys.availability_groups AS ag
LEFT JOIN sys.dm_hadr_availability_group_states AS ags
    ON ags.group_id = ag.group_id
LEFT JOIN sys.availability_replicas AS ar
    ON ar.group_id = ag.group_id
ORDER BY ag.name, ar.replica_server_name;

primary_replica shows the current owner. availability_mode_desc and failover_mode_desc show whether automatic failover is even possible. automated_backup_preference_desc affects where backup jobs should run. health_check_timeout and failure_condition_level explain part of automatic failover behavior.

How to check replica role and health

Shows local/remote replica role, operational state, connection state, recovery health, and synchronization health.

SELECT
    ag.name AS availability_group_name,
    ar.replica_server_name,
    ars.is_local,
    ars.role_desc,
    ars.operational_state_desc,
    ars.connected_state_desc,
    ars.recovery_health_desc,
    ars.synchronization_health_desc,
    ars.last_connect_error_number,
    ars.last_connect_error_description,
    ars.last_connect_error_timestamp
FROM sys.dm_hadr_availability_replica_states AS ars
JOIN sys.availability_replicas AS ar
    ON ar.replica_id = ars.replica_id
JOIN sys.availability_groups AS ag
    ON ag.group_id = ar.group_id
ORDER BY ag.name, ar.replica_server_name;

role_desc should match the expected primary and secondary layout. connected_state_desc should be CONNECTED before planned failover. synchronization_health_desc should be healthy for normal drills. last_connect_error fields help explain recent replica communication problems.

How to check database synchronization before failover

Database-level synchronization is where failover risk becomes concrete. A replica can be connected while one database is suspended, behind on redo, or not ready for failover.

How to check database synchronization queues

Shows synchronization state, health, database state, suspend reason, log send queue, and redo queue.

SELECT
    ag.name AS availability_group_name,
    ar.replica_server_name,
    COALESCE(adc.database_name, DB_NAME(drs.database_id)) AS database_name,
    drs.is_local,
    drs.synchronization_state_desc,
    drs.synchronization_health_desc,
    drs.database_state_desc,
    drs.is_suspended,
    drs.suspend_reason_desc,
    drs.log_send_queue_size,
    drs.log_send_rate,
    drs.redo_queue_size,
    drs.redo_rate,
    drs.last_commit_time
FROM sys.dm_hadr_database_replica_states AS drs
JOIN sys.availability_replicas AS ar
    ON ar.replica_id = drs.replica_id
JOIN sys.availability_groups AS ag
    ON ag.group_id = ar.group_id
LEFT JOIN sys.availability_databases_cluster AS adc
    ON adc.group_database_id = drs.group_database_id
ORDER BY ag.name, database_name, ar.replica_server_name;

synchronization_state_desc should fit the planned failover mode. is_suspended or a suspend reason needs investigation before failover. Large log_send_queue_size or redo_queue_size can affect data-loss exposure and recovery time. last_commit_time helps compare activity timing across replicas.

How to check failover readiness by database

Uses cluster replica state to show whether each AG database is ready for failover.

SELECT
    ag.name AS availability_group_name,
    ar.replica_server_name,
    dcs.database_name,
    dcs.is_failover_ready,
    dcs.is_pending_secondary_suspend,
    dcs.is_database_joined,
    dcs.recovery_lsn,
    dcs.truncation_lsn
FROM sys.dm_hadr_database_replica_cluster_states AS dcs
JOIN sys.availability_replicas AS ar
    ON ar.replica_id = dcs.replica_id
JOIN sys.availability_groups AS ag
    ON ag.group_id = ar.group_id
ORDER BY ag.name, dcs.database_name, ar.replica_server_name;

is_failover_ready should be 1 for normal planned failover candidates. is_database_joined confirms whether the database is joined on the replica. Pending secondary suspend needs review before role change. Use this before failover drills and before incident decisions.

How to validate listener and application cutover

A successful role change is not enough if applications cannot connect afterward. Validate the listener path, DNS behavior, driver settings, connection strings, read-only routing, and real application workflows through the same path users use.

How to inspect listener state

Shows AG listener DNS name, port, IP state, TCP listener state, and cluster IP configuration.

SELECT
    ag.name AS availability_group_name,
    agl.dns_name,
    agl.port,
    agl.is_conformant,
    agl.ip_configuration_string_from_cluster,
    ip.ip_address,
    ip.state_desc AS ip_state_desc,
    tls.type_desc AS listener_type_desc,
    tls.state_desc AS tcp_listener_state_desc,
    tls.start_time
FROM sys.availability_group_listeners AS agl
JOIN sys.availability_groups AS ag
    ON ag.group_id = agl.group_id
LEFT JOIN sys.availability_group_listener_ip_addresses AS ip
    ON ip.listener_id = agl.listener_id
LEFT JOIN sys.dm_tcp_listener_states AS tls
    ON tls.listener_id = agl.listener_id
ORDER BY ag.name, agl.dns_name, ip.ip_address;

dns_name and port should match application connection strings. ip_state_desc and tcp_listener_state_desc should be online for active listener paths. Multi-subnet listeners need matching client driver and connection-string behavior. Validate applications through the listener after a drill.

FieldWhat to check
quorum_type_descThe witness/quorum model. Confirm it matches the expected cluster design.
quorum_state_descWhether quorum is healthy enough for planned failover work.
availability_mode_descWhether the replica is synchronous or asynchronous.
failover_mode_descWhether failover is automatic or manual.
synchronization_state_descWhether each database is synchronized, synchronizing, or not synchronized.
synchronization_health_descWhether SQL Server considers replica or database synchronization healthy.
log_send_queue_sizeHow much log still has to be sent to the secondary.
redo_queue_sizeHow much received log still has to be redone on the secondary.
is_failover_readyWhether the database is ready for normal AG failover.
dns_nameThe listener name applications should use.
ip_state_descWhether listener IP resources are online.
tcp_listener_state_descWhether SQL Server is listening on the expected listener path.
is_preferred_backup_replicaWhether this replica should run backup jobs under AG backup preference rules.

What to verify after SQL Server failover

After failover, the new primary may need to run work that previously ran somewhere else. Backup jobs, maintenance, monitoring, reporting, and alerts must either follow the role or deliberately stay tied to a node.

How to check preferred backup replica

Checks whether the current replica is preferred for backups for each AG database.

SELECT
    ag.name AS availability_group_name,
    adc.database_name,
    ag.automated_backup_preference_desc,
    sys.fn_hadr_backup_is_preferred_replica(adc.database_name) AS is_preferred_backup_replica
FROM sys.availability_groups AS ag
JOIN sys.availability_databases_cluster AS adc
    ON adc.group_id = ag.group_id
ORDER BY ag.name, adc.database_name;

is_preferred_backup_replica returns 1 when the current replica is the preferred backup location. Backup jobs should check this when AG backup preference is used. A failover can change which replica should run backups. Do not assume backups are healthy after role change; verify job output.

How to review SQL Agent jobs around failover

Reads SQL Agent history for jobs likely to matter around failover.

SELECT TOP (120)
    j.name AS job_name,
    j.enabled,
    h.step_id,
    h.step_name,
    h.run_status,
    DATETIMEFROMPARTS(
        h.run_date / 10000,
        (h.run_date % 10000) / 100,
        h.run_date % 100,
        h.run_time / 10000,
        (h.run_time % 10000) / 100,
        h.run_time % 100,
        0
    ) AS run_start_time,
    ((h.run_duration / 10000) * 3600)
      + (((h.run_duration % 10000) / 100) * 60)
      + (h.run_duration % 100) AS run_duration_seconds,
    h.message
FROM msdb.dbo.sysjobs AS j
LEFT JOIN msdb.dbo.sysjobhistory AS h
    ON h.job_id = j.job_id
WHERE h.run_date >= CONVERT(int, CONVERT(char(8), DATEADD(day, -14, SYSDATETIME()), 112))
  AND (
      j.name LIKE N'%backup%'
      OR j.name LIKE N'%log%'
      OR j.name LIKE N'%monitor%'
      OR j.name LIKE N'%alert%'
      OR j.name LIKE N'%maintenance%'
      OR j.name LIKE N'%checkdb%'
      OR h.step_name LIKE N'%backup%'
      OR h.step_name LIKE N'%monitor%'
  )
ORDER BY run_start_time DESC, j.name, h.step_id;

run_status identifies failed or retrying job steps. message often shows path, permission, replica, or preferred-backup problems. run_duration_seconds helps compare drill timing and operational load. Run this before and after failover drills.

How to run a planned SQL Server failover drill

A failover drill should measure the role change, listener behavior, application validation, job behavior, backup behavior, monitoring alerts, and the decision to stay or fail back.

How to run a planned AG failover

Non-executing template for a planned AG failover after readiness checks pass.

-- Template only. Use during a planned drill after readiness checks pass.
-- Run on the target synchronous secondary replica.

ALTER AVAILABILITY GROUP [YourAvailabilityGroup] FAILOVER;

Run this only on the intended target synchronous secondary. Confirm synchronization health and failover readiness first. Application owners should be ready to validate immediately afterward. Record timing and errors during the drill.

How to decide on forced failover

Decision checklist for incidents where forced failover is being considered.

Forced failover is an incident decision, not a normal test step.

Before considering it:
1. Confirm the primary is unavailable or cannot be trusted.
2. Confirm which secondary has the most recent hardened data.
3. Confirm data-loss exposure with the business owner.
4. Record the decision, target replica, and validation plan.
5. Reconcile or rebuild the old primary after service is restored.

Forced failover can lose data. Use it as an incident decision, not a normal test. The target replica choice should be based on hardened data and business impact. The old primary needs reconciliation before it can safely return.

How to document a failover drill

Plain checklist for the failover drill, validation, and stay/failback decision.

Failover drill checklist

1. Announce the drill window and expected impact.
2. Confirm quorum, replica health, synchronization state, and failover readiness.
3. Confirm backup jobs, monitoring, and application owners are ready.
4. Fail over to the target replica or node.
5. Verify listener, DNS, connection strings, and application workflows.
6. Check SQL Agent jobs, backups, alerts, and monitoring after role change.
7. Decide whether to stay on the new primary or fail back.
8. Record timing, errors, manual steps, and follow-up fixes.

Each step should have an owner and expected duration. Application validation should happen through the same path users use. Backup and monitoring checks belong in the drill, not later. Follow-up fixes are part of the drill output.

When forced failover is an incident decision

Forced failover belongs in incident handling, not routine testing. Use it only when the current primary cannot be trusted or recovered quickly enough, and the business accepts the data-loss and reconciliation risk.

1

Do not fail over before checking synchronization health and business impact.

2

Do not treat AG or FCI as a backup replacement.

3

Do not assume automatic failover is safe because it is configured.

4

Do not ignore quorum, witness, listener, DNS, and client driver behavior.

5

Do not patch, reboot, or fail over without knowing which jobs and backups run on each replica.

6

Do not force failover unless the data-loss and recovery consequences are understood.

7

Do not leave backup jobs, monitoring, or maintenance pointing only at the old primary.

8

Do not call HA ready until application validation passes after a drill.

When to request a SQL Server recovery readiness review

A recovery readiness review makes sense when the AG or FCI exists but failover has not been tested, quorum behavior is unclear, listener behavior is fragile, backups depend on replica role, or the application validation path is not written down.

Send AG and FCI configuration, quorum output, replica health, synchronization state, listener state, job history, backup behavior, monitoring alerts, failover drill output, and the main business target for recovery time and data loss.

Next step

If failover behavior, backup preference, listener routing, or validation is not clear yet, use the SQL Server recovery readiness page or request HA review above.

Next useful reads: the SQL Server recovery guide for restore and rollback, the SQL Server monitoring guide for ongoing checks, the SQL Server backup guide for backup behavior, and the SQL Server sizing guide for workload fit.