Portrait of Mihaly Kertesz

sql server / services / health audit

SQL Server
health audit.

This is for estates that still run, but no longer feel understood.

Usually that means inherited jobs, partial monitoring, vague backup confidence, old tempdb choices, and a team that can feel the risk without naming it cleanly. The point of the audit is not to produce another health score. The point is to turn a messy estate into a practical fix order.

Related

Use the SQL Server health check guide for the wider review logic, the SQL Server maintenance plan guide for inherited maintenance drift, the SQL Server monitoring guide for visibility gaps, and the monitoring gaps problem page when the main issue is weak evidence rather than one confirmed root cause.

Good fit

  • You inherited a SQL Server estate and do not trust the current operational state.
  • Monitoring, maintenance, tempdb, backup, restore, and security questions are all starting to overlap.
  • The environment works most days, but the team has no clean view of what would fail first under change or pressure.
  • An upgrade, migration, handover, audit, or client escalation is coming and the team needs a sharper fix order.

What you get

  • A clear findings list tied to operational risk instead of a generic best-practice scorecard.
  • A practical separation between immediate fixes, short follow-up work, and wider project items.
  • A clearer view of whether the estate mainly needs health remediation, recovery work, performance review, or planned-change support.

What the problem usually looks like

Most health-audit work starts with operational uncertainty, not one dramatic outage

A lot of estates needing a health audit are technically up. Users can work. Backups run. Monitoring exists in some form. That is exactly why these environments stay untouched for too long. They are not clearly broken enough to force a redesign, but they are not controlled enough to trust during real pressure.

The usual story is some mix of inherited ownership, missing documentation, maintenance jobs nobody wants to touch, partial alerting, old configuration assumptions, and a vague sense that production would become messy quickly if an upgrade, restore, or performance incident landed tomorrow.

That is the point where a health audit helps. It turns the estate from a pile of half-known assumptions into a concrete review of what is stable, what is drifting, and what is already operating on luck.

What we review

The useful review areas are the ones that explain operational risk, not the ones that fill a long checklist

A useful SQL Server health audit is not an excuse to list every setting in the instance. The real job is to review the parts of the estate that most often create false confidence: configuration that looks deliberate but is inherited, maintenance that runs but is incomplete, backups without restore proof, alerting that says little, and operational ownership that is split badly enough to hide risk.

That usually means reading the estate as a working system. How many instances matter. Which databases are critical. What the recovery expectations really are. Whether tempdb pressure is isolated or part of a wider workload problem. Whether the monitoring proves anything useful during trouble. Whether change work is being planned on top of assumptions nobody has checked in years.

Typical review areas

  • Instance and host setup, including obvious configuration drift and inherited default choices.
  • Tempdb layout, growth behavior, and whether tempdb symptoms are really wider workload problems.
  • Maintenance quality: integrity checks, backup jobs, cleanup, index and statistics work, and failure handling.
  • Restore confidence, recovery assumptions, and whether backup success actually means recovery readiness.
  • Monitoring quality, alerting gaps, and whether the team can prove what is happening during trouble.
  • Security and ownership basics where weak access discipline or abandoned responsibility is already part of the risk.

Deliverables

Good output looks like a fix order, not a pile of disconnected findings

Teams usually do not need more raw findings. They need a cleaner answer on what matters now, what can wait, and which problems are actually the same problem wearing different clothes. Maintenance drift, restore weakness, weak monitoring, and ownership gaps often show up together. If the output treats them as isolated trivia, the audit has not done enough.

The useful outcome is a report and discussion that separates immediate operational risk from medium-term cleanup and from wider project work. Sometimes that leads into recovery-readiness review. Sometimes it becomes performance work once visibility improves. Sometimes it becomes planned-change support because the estate is about to move before it is really understood.

OutputWhat it should answerWhy it matters
Risk-ranked findingsWhat is actually brittle, misleading, or already unsafe to ignore.This stops the team treating every finding as equally urgent.
Fix orderWhat to do now, what to schedule next, and what belongs to wider project work.Without sequencing, the audit becomes shelfware.
Escalation pathWhether the estate next needs health remediation, recovery review, performance diagnosis, or planned change support.This gives the review a real next step instead of a dead end.

When this is not the right first step

  • A live blocking incident that needs immediate production triage first.
  • A narrow one-query tuning exercise where the rest of the estate is already well understood.
  • A pure upgrade rollout engagement where readiness and cutover planning are already the main concern.

When outside help makes sense

Outside review usually makes sense when the team already knows the estate is drifting but does not have the time, neutrality, or deep SQL ownership to turn that into a clean fix order. It also helps when the environment is politically awkward: handovers, client estates, vendor-owned applications, or teams where SQL responsibility is shared just enough that nobody owns the review properly.

If the main need is to reduce uncertainty before the next change, that is the point of this service. If the estate has already tipped into one specific live failure mode, then it often makes more sense to start with the narrow incident path and come back to the wider audit once production is calmer.

Next step

If the estate needs a practical review before the next upgrade, migration, outage, or handover, use contact and describe the estate, the current concern, and whether the main pressure is review, change readiness, or incident risk.

If you want the wider technical framing first, read the SQL Server health check guide, the SQL Server maintenance plan guide, and the SQL Server tempdb guide.

If the main issue is already restore confidence rather than general estate health, the better next page is SQL Server recovery readiness.