In an era where data is widely regarded as the lifeblood of modern enterprise, the health and integrity of your organisation’s database infrastructure cannot be left to chance. A poorly maintained database environment introduces a cascade of risks — from degraded query performance and unplanned downtime to regulatory non-compliance under the UK General Data Protection Regulation (UK GDPR) and the Data Protection Act 2018.
A database health check is not a one-off event. It is a structured, repeatable discipline that should be embedded into your IT governance framework alongside your broader Business Continuity Planning (BCP) and Disaster Recovery (DR) strategies. Whether you are managing a relational database management system (RDBMS) such as Microsoft SQL Server, Oracle Database, or PostgreSQL, or operating across hybrid cloud architectures incorporating managed services like Amazon RDS or Azure SQL Managed Instance, this checklist provides a comprehensive framework for assessing, monitoring, and hardening your database estate.
Section 1: Performance & Resource Utilisation
Performance degradation is one of the most common — and most insidious — symptoms of a deteriorating database environment. Problems typically manifest gradually, and by the time end users notice latency, the root cause may have compounded considerably.
1.1 Query Performance Analysis
Begin by reviewing execution plan analysis to surface long-running queries and identify those suffering from plan regression — where the query optimiser has cached a suboptimal execution plan, often as a result of parameter sniffing. This occurs when the engine compiles a plan based on atypical initial parameter values, producing execution strategies that are efficient in isolation but catastrophically inefficient under normal workloads.
Missing index recommendations should be audited regularly. Whilst adding indexes improves read performance, each index introduces overhead on write operations, so unused or redundant indexes must be identified and removed with equal diligence. Query wait statistics are particularly instructive — they reveal whether bottlenecks are rooted in input/output contention, lock contention between concurrent transactions, or memory pressure preventing query compilation.
1.2 CPU, Memory & I/O
CPU utilisation should be monitored over rolling daily and weekly windows. Sustained CPU saturation above 80% on a database server is rarely a hardware problem in isolation; it is almost always a symptom of inefficient query workloads, missing indexes, or inappropriate parallelism configuration.
The buffer pool hit ratio is one of the most telling indicators of memory health. For Online Transaction Processing (OLTP) workloads, a hit ratio below 98% suggests that the active working data set exceeds the available RAM, forcing the engine to repeatedly read pages from disk rather than serving them from memory. In such scenarios, the choice is typically between adding RAM, partitioning the workload, or archiving historical data to reduce the active footprint.
Input/output latency on data files and transaction log volumes warrants close attention. In high-transaction environments, elevated latency — above 20 milliseconds for OLTP — has a compounding effect on throughput and can cause cascading timeouts across dependent application tiers. Tiered storage strategies and NVMe-based volumes are increasingly common remedies in both on-premises and cloud-hosted deployments.
Temporary database contention is a frequently overlooked bottleneck, particularly in SQL Server environments. The temporary workspace used for sorting, hashing, and intermediate result storage should be distributed across multiple data files and housed on fast local storage to prevent allocation contention under concurrent workloads.
1.3 Capacity Planning
Capacity planning must be grounded in historical data growth trends rather than assumptions. Review how quickly data volumes are expanding across your most active databases and project storage requirements over a 12- to 24-month horizon. Autogrowth events — where database files expand dynamically because available space has been exhausted — are disruptive and should be viewed as a planning failure rather than a feature.
Virtual Log File (VLF) proliferation within transaction logs is another area that frequently escapes attention until it causes a performance incident. Excessive VLF counts degrade log backup performance and recovery times. When identified, they should be corrected through a controlled shrink and regrowth cycle during a maintenance window.
Section 2: Availability & Reliability
2.1 High Availability Configuration
The operational status of your High Availability (HA) topology should be verified routinely, not solely when an incident is suspected. Whether your environment relies on Always On Availability Groups, Oracle Data Guard, PostgreSQL Streaming Replication, or a cloud-native equivalent, the synchronisation state between primary and secondary replicas must remain within the Recovery Point Objective (RPO) thresholds defined in your Service Level Agreement.
Quorum configuration in Windows Server Failover Clustering environments deserves particular scrutiny. A misconfigured or unreachable witness node can prevent automatic failover or, worse, result in a split-brain condition where two nodes simultaneously assume the primary role — with potentially catastrophic data integrity consequences.
Automatic failover must be tested, not assumed. Failover testing should be a scheduled, documented activity — ideally quarterly — rather than an ad-hoc exercise conducted under the pressure of an actual outage. A failover that works in theory but fails in practice during an emergency is arguably more damaging than having no failover capability at all, because it creates false confidence.
2.2 Backup & Recovery Validation
Backup configuration is arguably the single most critical component of any database health check. A backup that has never been tested is not a backup — it is a liability.
Full, differential, and transaction log backups must be completing successfully and on schedule. Verification should extend beyond confirming that a job ran without error; backup checksums should be enabled so that page-level corruption is detected at the point the backup is written, rather than discovered during a restore attempt under pressure. Backup integrity is one area where discovering a problem early is infinitely preferable to discovering it late.
Test restores to an isolated environment must be conducted regularly. For UK GDPR compliance, any restore environment containing personal data must be subject to access controls equivalent to production and must not retain that data beyond operational necessity. The restore process itself should be timed against your Recovery Time Objective (RTO) — the maximum tolerable period of downtime — to confirm that your architecture can realistically achieve the recovery commitments documented in your SLA.
Offsite and off-region backup storage is non-negotiable for disaster recovery readiness. In cloud-hosted environments, this means ensuring backup data is replicated to a geographically separate region in accordance with your data residency obligations — a consideration that intersects directly with UK GDPR’s requirements around international data transfers and storage location.
2.3 Replication Health
For environments using transactional replication, distributor latency should be reviewed regularly to confirm that replicated commands are being applied to subscriber databases within acceptable windows. Accumulating undistributed commands are a leading indicator of replication failure and, if left unresolved, can result in subscriber databases diverging significantly from the publisher.
IT managers should also be aware that legacy database mirroring — a feature deprecated in older SQL Server releases and removed entirely in more recent versions — is no longer a supported high-availability mechanism. Environments still relying on this architecture should have a migration to Always On Availability Groups planned and resourced as a priority.
Section 3: Security & Compliance
Database security is both a technical and a regulatory obligation. Under the UK GDPR, organisations acting as data controllers or processors must implement appropriate technical and organisational measures to protect personal data. Failure to do so can result in enforcement action by the Information Commissioner’s Office (ICO), including fines of up to £17.5 million or 4% of global annual turnover — whichever is higher.
3.1 Access Controls & Privilege Management
The principle of least privilege (PoLP) must be enforced consistently across all database accounts. Service accounts and application accounts should hold only the permissions required to perform their specific functions — granting elevated administrative roles to these accounts is an extremely common misconfiguration that significantly amplifies the blast radius of any compromise.
Default administrative accounts should be disabled or renamed. Separation of Duties (SoD) must be implemented to ensure that database administrators do not have unrestricted, unaudited access to production data containing personally identifiable information (PII) without a formal authorisation workflow. This is not merely a security best practice — in heavily regulated sectors such as financial services and healthcare, it is a compliance requirement.
Service account credentials and connection string passwords must be audited and rotated on a defined schedule. Where cloud infrastructure is involved, the use of Managed Identities or IAM Roles to eliminate static credentials entirely represents a significant security improvement and should be adopted where operationally feasible. Orphaned database users — principals that exist at the database level but are no longer mapped to a valid server login — should be identified and removed to reduce the attack surface.
3.2 Encryption
Transparent Data Encryption (TDE) should be enabled for all databases holding sensitive or regulated data. TDE protects data at rest — including backup files and transaction logs — ensuring that physical theft or unauthorised access to storage media does not result in readable data.
Encryption in transit is equally important. TLS 1.2 should be enforced as the minimum standard for all client-to-server database connections, with TLS 1.3 preferred where supported. Legacy protocols including SSL 3.0 and earlier versions of TLS must be explicitly disabled, as they are subject to well-documented cryptographic vulnerabilities.
For highly sensitive data fields — National Insurance numbers, financial account identifiers, protected health information — column-level encryption provides an additional layer of protection that operates independently of TDE. This is particularly relevant in cloud environments where database administrators may have broad infrastructure access, and where restricting visibility of specific data fields at the encryption layer is the most robust control available.
Encryption key management must be treated as seriously as the encryption itself. Keys stored alongside the data they protect offer limited security benefit. Centralised key management services, such as Azure Key Vault or AWS Key Management Service, should be used to hold encryption keys separately from backup media and database volumes.
3.3 Auditing & Logging
Native database auditing must be enabled and configured to capture login events, privilege escalation activities, schema changes, and access to tables holding sensitive data. Relying on application-layer logging alone is insufficient — it does not capture direct database access by DBAs, developers, or compromised service accounts.
Audit logs should be centralised into a Security Information and Event Management (SIEM) platform. Centralisation enables correlation across multiple systems and supports the detection of anomalous patterns — such as bulk data extraction outside business hours, schema modifications by non-administrative accounts, or repeated failed authentication attempts that may indicate a brute-force attack or credential stuffing campaign.
Retention of audit logs must align with your data retention policy and applicable legal obligations. The UK GDPR’s accountability principle requires organisations to be able to demonstrate compliance — and in the event of an ICO investigation following a data breach, comprehensive audit trails are among the most important evidence an organisation can produce.
3.4 Vulnerability Assessment & Patching
Database environments should be subject to regular vulnerability assessments using recognised tooling aligned with industry benchmarks such as those published by the Centre for Internet Security (CIS). Assessment findings should be documented, risk-rated, and tracked through to remediation or formal risk acceptance.
Cumulative updates and security patches must be applied in a timely manner. Database vendors release security updates on predictable schedules, and organisations that fall significantly behind the current patch level are knowingly operating with publicly disclosed vulnerabilities. A defined patching policy — specifying maximum acceptable lag from vendor release to production deployment — is an essential governance control.
Section 4: Integrity & Data Quality
4.1 Database Consistency Checks
Database consistency checks are the mechanism by which the database engine verifies the physical and logical integrity of every page, allocation structure, and object within a database. They should be scheduled and verified regularly — at minimum weekly for production databases — with completion status reviewed after every run.
For Very Large Databases (VLDBs) where a full consistency check cannot be completed within a maintenance window, a rolling approach is appropriate: lighter physical-only checks can be run nightly, with comprehensive logical checks performed on a weekly or monthly cycle. The key principle is that no database should go for an extended period without any form of integrity validation. Page corruption events, when they occur, can go undetected for months if consistency checks are absent — and the longer corruption goes undetected, the greater the likelihood that it has propagated into backups.
4.2 Index Maintenance
Index fragmentation accumulates naturally as data is inserted, updated, and deleted over time. Fragmented indexes cause the query engine to perform significantly more input/output operations than necessary, degrading read performance in a manner that is often gradual and therefore easily misattributed to other causes.
Index maintenance should be scheduled during low-activity windows and calibrated based on measured fragmentation levels rather than run uniformly across all objects regardless of need. Online index rebuild operations are available for production environments that cannot tolerate blocking, allowing index maintenance to proceed concurrently with live workloads — though they do consume additional I/O and CPU during execution.
4.3 Statistics Management
The query optimiser’s ability to generate efficient execution plans is fundamentally dependent on the accuracy of column and index statistics. Stale statistics — where the statistical information held by the engine no longer accurately reflects the actual distribution of data — can lead the optimiser to select dramatically suboptimal execution strategies, resulting in poor query performance that is notoriously difficult to diagnose without examining the statistics directly.
Automatic statistics updates should be enabled, but they are not infallible. Following bulk data loads, statistics on heavily modified tables should be updated manually, as the automatic threshold may not trigger promptly enough to prevent suboptimal plan compilation against the freshly loaded data.
Section 5: Configuration & Architecture Review
5.1 Database-Level Configuration
Database compatibility level controls which query optimiser behaviours and cardinality estimation models are active. Databases running at an outdated compatibility level miss the benefit of optimiser improvements introduced in more recent engine versions. Conversely, raising the compatibility level without adequate testing can cause plan regressions if existing queries relied on legacy optimiser behaviour. Compatibility level changes should always be preceded by thorough regression testing in a non-production environment.
Database collation must be consistent across databases that interact with one another through cross-database queries or linked server connections. Collation mismatches force implicit data type conversions that prevent the engine from using indexes efficiently, introducing hidden performance penalties that are difficult to identify without examining execution plans in detail.
The recovery model configured for each database must align with the backup strategy. Databases configured for full recovery must have transaction log backups running; without them, the log will grow unbounded until disk space is exhausted. The auto-shrink setting must be disabled across all production databases — it is one of the most damaging default configurations that can exist in a production environment, causing index fragmentation and significant I/O overhead each time it fires.
5.2 Instance-Level Configuration
Maximum server memory must be explicitly configured on every SQL Server instance. The default setting permits the database engine to consume virtually unlimited host memory, which can destabilise the operating system and any co-hosted services by starving them of RAM. The correct value depends on the total host memory and whether other workloads share the server, but leaving it at the default is never appropriate in a production environment.
Parallelism configuration — specifically the maximum degree of parallelism and the cost threshold at which parallel query plans are considered — should be reviewed against Microsoft’s current recommendations and calibrated to the server’s NUMA architecture. Poorly configured parallelism is a significant source of unnecessary CPU consumption and can cause thread contention on busy systems.
Linked server connections should be audited and those no longer in active use should be removed. Beyond the performance implications of stale linked server definitions, each active linked server represents a potential lateral movement path in the event of a security compromise.
5.3 Cloud & Hybrid Architecture
For cloud-hosted database services such as Azure SQL Database or Amazon RDS, resource tier sizing should be reviewed against actual utilisation metrics. Both persistent over-provisioning — which inflates operating costs unnecessarily — and under-provisioning — which degrades performance and stability — are common findings in environments that were sized at initial deployment and never subsequently reviewed.
Network access configuration deserves particular scrutiny. Databases accessible over the public internet without strict IP allowlisting or private endpoint enforcement represent a critical security risk and are likely to constitute a failure to implement appropriate technical measures under the UK GDPR’s data security obligations. Private endpoint configuration and Virtual Network integration should be the baseline expectation for any production cloud database, not an optional enhancement.
Section 6: Monitoring, Alerting & Observability
6.1 Monitoring Coverage
Effective database monitoring must go beyond simple availability checks. The metrics that matter most for a complete picture of database health include CPU utilisation, memory pressure expressed through Page Life Expectancy, input/output latency, active connection counts, blocking chain duration, deadlock frequency, and failed authentication events.
Blocking and deadlocking deserve special attention. Short-lived blocking is a normal characteristic of concurrent database workloads; persistent blocking chains that extend across seconds or minutes indicate deeper problems with transaction design, indexing strategy, or isolation level configuration. Deadlock events — where two or more transactions permanently block one another — should be captured, logged, and reviewed regularly, as their presence typically reflects a structural flaw that will recur and worsen as workload grows.
6.2 Alerting & Escalation
Alert thresholds must be configured to provide meaningful early warning without generating alert fatigue. Critically, all database alerts should be routed into the organisation’s IT Service Management (ITSM) platform — whether ServiceNow, Jira Service Management, or an equivalent — to ensure they are tracked against SLA, assigned to appropriate owners, and resolved in a manner that creates an auditable record.
Alerts that fire and are silently acknowledged without investigation or resolution are arguably worse than no alerts at all. They normalise alert noise, erode team responsiveness, and — when one of those alerts eventually signals a genuine, serious issue — risk being dismissed until it is too late.
Section 7: Documentation & Change Management
A technically sound database environment without adequate documentation is fragile. The operational knowledge needed to manage a database estate effectively — understanding its architecture, its dependencies, its maintenance schedules, and its recovery procedures — must not reside solely in the minds of individual team members.
7.1 Database Inventory & Documentation
A current and accurate database inventory is the foundation of effective governance. It should capture, at minimum: instance names and versions, hosting location, data classification, regulatory scope, business owner, and primary DBA contact. This inventory must be treated as a living document and reviewed whenever the database estate changes.
Scheduled maintenance jobs — covering backup schedules, consistency checks, index maintenance, and statistics updates — should each have documented ownership, expected completion windows, and defined escalation paths for failures. Runbooks for common operational procedures — failover, backup restore, emergency log file management — must be written, tested, and stored in a location accessible to on-call personnel at any hour.
7.2 Change Management
All changes to production database environments must pass through a formal change control process. For organisations operating under ITIL or equivalent frameworks, this means submission to and approval by a Change Advisory Board (CAB) before any work is carried out. Emergency changes are inevitable in any live environment, but they must be retrospectively documented with root cause analysis and impact assessment to prevent the same emergency from recurring.
The discipline of change management in database environments is not bureaucratic overhead — it is the primary mechanism by which organisations protect the stability of their most critical data assets and maintain the audit trail required to demonstrate operational governance to internal stakeholders, regulators, and auditors alike.
Recommended Health Check Cadence
Backup completion and job failure reviews should be conducted daily without exception. Disk space and input/output latency trends benefit from daily monitoring to catch gradual degradation before it becomes critical.
Index fragmentation analysis, database consistency checks, and query performance reviews are well-suited to a weekly cadence in most environments. Security audit activities — including privilege reviews and access control validation — should be performed monthly, alongside a comprehensive consistency check pass. Disaster recovery failover testing, vulnerability assessments, and patch compliance reviews should occur quarterly. A full architecture and capacity review, incorporating growth projections and infrastructure alignment, should be conducted at least annually.
Conclusion
A robust database health check programme is not merely a technical exercise — it is a critical component of organisational resilience, regulatory compliance, and service quality assurance. IT managers who implement this checklist systematically will be better positioned to pre-empt performance degradation, demonstrate compliance to the ICO, minimise data breach exposure, and deliver against the availability commitments enshrined in their service level agreements.
The databases you manage are not passive repositories. They are dynamic, interconnected systems that degrade under neglect and thrive under disciplined stewardship. Invest in their governance as seriously as you invest in the applications they support — because when a database fails, so does everything built upon it.