SQLFrontline Case Study: Failing Backups

Over the course of 20 years dealing with SQL server, I’ve come across failing backups more times than I’d like to recall. (And that’s not counting the times there were no backups setup in the first place!)

In the case of failing backups, backups were set up, checked to be working, and then at a later date subsequently failed to notify of backup failures for several reasons (non-exhaustive):

Configuration on the SMTP server changed, such as the allowed IP white list for forwarding, permissions changed, or the actual SMTP server changed.
Virus scanner configuration changed preventing emails to be sent.
AD group permissions changed or SQL server service identities changed.

In all these cases, backups were failing but no one was being alerted and no one was periodically checking the SQL agent logs.

SQLFrontline checks for no backups in the last 7 days, backups done without compression (compressed backups take up less space obviously, but are also faster to backup and restore), backups done without verifying page checksums, and backups done without encryption (if your version of SQL Server supports it). It also checks that you are periodically running DBCC CHECKDB to maintain database integrity, and whether any data corruption has been detected (automatically repaired or otherwise).

SQLFrontline currently performs 62 Reliability checks on each SQL Server you monitor, with 300+ checks performed across the categories of Reliability, Performance, Security, Configuration and Database Design.

Side Note: Another thing to consider is, do you delete older backups BEFORE making sure the latest backup succeeded? If so, you might end up with no recent local backups at all when your backup job starts failing… If you use Ola Hallegren’s maintenance solution scripts, this check is performed correctly for you.