SQLFrontline Case Study: Failing Backups

Over the course of 20 years dealing with SQL server, I’ve come across failing backups more times than I’d like to recall. (And that’s not counting the times there were no backups setup in the first place!)

In the case of failing backups, backups were set up, checked to be working, and then at a later date subsequently failed to notify of backup failures for several reasons (non-exhaustive):

  1. Configuration on the SMTP server changed, such as the allowed IP white list for forwarding, permissions changed, or the actual SMTP server changed.
  2. Virus scanner configuration changed preventing emails to be sent.
  3. AD group permissions changed or SQL server service identities changed.

In all these cases, backups were failing but no one was being alerted and no one was periodically checking the SQL agent logs.

SQLFrontline checks for no backups in the last 7 days, backups done without compression (compressed backups take up less space obviously, but are also faster to backup and restore), backups done without verifying page checksums, and backups done without encryption (if your version of SQL Server supports it). It also checks that you are periodically running DBCC CHECKDB to maintain database integrity, and whether any data corruption has been detected (automatically repaired or otherwise).

SQLFrontline currently performs 62 Reliability checks on each SQL Server you monitor, with 300+ checks performed across the categories of Reliability, Performance, Security, Configuration and Database Design.

Side Note: Another thing to consider is, do you delete older backups BEFORE making sure the latest backup succeeded? If so, you might end up with no recent local backups at all when your backup job starts failing… If you use Ola Hallegren’s maintenance solution scripts, this check is performed correctly for you.

Set Up SQL Server Agent Alerts

SQL Server alerts are a free and easy way to get notified of corruption, agent job failures, or major failures before you get angry phone calls from users.

You should have already set up Database Mail and an alert operator to receive notifications. TIP: Always use an email distribution list for your operator notification email. If you use a person’s email, they might change jobs, or be on holiday…

The script below sets up the standard SQL Server Agent alerts for severity 17 through 25 as well as specific alerts for 823, 824 and 825 errors, and a bunch of others:


USE [msdb];
GO
 
SET NOCOUNT ON;

BEGIN TRANSACTION
 
 -- Change these as required...
DECLARE @Normal_Delay_Between_Responses int = 900   -- in seconds
DECLARE @Medium_Delay_Between_Responses int = 3600 
DECLARE @Long_Delay_Between_Responses int   = 14400 
DECLARE @OperatorName sysname = N'SQLDBAGroup'
DECLARE @CategoryName sysname = N'SQL Server Agent Alerts';

DECLARE @DelayBetweenResponses int
    
DECLARE @TotalAlerts int
DECLARE @Row int = 1
DECLARE @AlertName sysname
DECLARE @SeverityNo int
DECLARE @ErrorNo int
 
DECLARE @Alerts TABLE
(
    Id int IDENTITY(1,1) NOT NULL
    ,SeverityNo int NULL
    ,ErrorNo int NULL
    ,AlertName sysname NOT NULL
    ,DelayBetweenResponses int NOT NULL
);
 
INSERT @Alerts (SeverityNo, AlertName, DelayBetweenResponses) 
VALUES
     (17, 'Alert - Sev 17 - Insufficient Resources', @Normal_DELAY_BETWEEN_RESPONSES)
    ,(18, 'Alert - Sev 18 - Nonfatal Internal Error', @Normal_DELAY_BETWEEN_RESPONSES)
    ,(19, 'Alert - Sev 19 - Fatal Error in Resource', @Normal_Delay_Between_Responses)
    ,(20, 'Alert - Sev 20 - Fatal Error in Current Process', @Normal_Delay_Between_Responses)
    ,(21, 'Alert - Sev 21 - Fatal Error in Database Process', @Normal_Delay_Between_Responses)
    ,(22, 'Alert - Sev 22 - Fatal Error: Table Integrity Suspect', @Normal_Delay_Between_Responses)
    ,(23, 'Alert - Sev 23 - Fatal Error: Database Integrity Suspect', @Normal_Delay_Between_Responses)
    ,(24, 'Alert - Sev 24 - Fatal Error: Hardware Error', @Normal_Delay_Between_Responses)
    ,(25, 'Alert - Sev 25 - Fatal Error', @Normal_Delay_Between_Responses);
 
INSERT @Alerts (ErrorNo, AlertName, DelayBetweenResponses) 
VALUES
     (601,  'Alert - Error 601 - NOLOCK scan aborted due to data movement', @Long_Delay_Between_Responses)
    ,(674,  'Alert - Error 674 - Exception occurred in destructor', @Long_Delay_Between_Responses)
    ,(708,  'Alert - Error 708 - Low virtual address space or low virtual memory', @Medium_Delay_Between_Responses)
    ,(806,  'Alert - Error 806 - Audit failure: page read from disk failed basic integrity checks', @Medium_Delay_Between_Responses)

    ,(823,  'Alert - Error 823 - I/O Error: http://support.microsoft.com/kb/2015755', @Normal_Delay_Between_Responses)
    ,(824,  'Alert - Error 824 - Consistency-based I/O error: http://support.microsoft.com/kb/2015756', @Normal_Delay_Between_Responses)
    ,(825,  'Alert - Error 825 - File Read Retry: http://support.microsoft.com/kb/2015757', @Normal_Delay_Between_Responses)
    ,(832,  'Alert - Error 832 - Constant page has changed: http://support.microsoft.com/kb/2015759', @Normal_Delay_Between_Responses)
    ,(833,  'Alert - Error 833 - Long I/O request', @Long_Delay_Between_Responses)

    ,(855,  'Alert - Error 855 - Uncorrectable hardware memory corruption detected', @Normal_Delay_Between_Responses)
    ,(856,  'Alert - Error 856 - SQL Server has detected hardware memory corruption, but has recovered the page', @Normal_Delay_Between_Responses)

    ,(1205, 'Alert - Error 1205 - Transaction Deadlock arbitrated', @Long_Delay_Between_Responses)

    ,(3401, 'Alert - Error 3401 - Errors occurred during recovery while rolling back a transaction', @Normal_Delay_Between_Responses)
    ,(3410, 'Alert - Error 3410 - Data in filegroup is offline', @Medium_Delay_Between_Responses)
    ,(3414, 'Alert - Error 3414 - Recovery Error', @Long_Delay_Between_Responses)
    ,(3422, 'Alert - Error 3422 - Database was shutdown', @Long_Delay_Between_Responses)
    ,(3452, 'Alert - Error 3452 - Recovery inconsistency', @Long_Delay_Between_Responses)
    ,(3619, 'Alert - Error 3619 - Could not write a checkpoint because the log is out of space', @Medium_Delay_Between_Responses)
    ,(3620, 'Alert - Error 3620 - Automatic checkpointing is disabled because the log is out of space', @Medium_Delay_Between_Responses)
    ,(3959, 'Alert - Error 3959 - Version store is full', @Normal_Delay_Between_Responses)
    ,(5029, 'Alert - Error 5029 - Warning: Log has been rebuilt', @Long_Delay_Between_Responses)
    ,(5144, 'Alert - Error 5144 - Autogrow of file was cancelled by user or timed out', @Long_Delay_Between_Responses)
    ,(5145, 'Alert - Error 5145 - Long Autogrow of file', @Medium_Delay_Between_Responses)
    ,(5182, 'Alert - Error 5182 - New log file created', @Long_Delay_Between_Responses)
    ,(9001, 'Alert - Error 9001 - Transaction log not available', @Normal_Delay_Between_Responses)
    ,(9002, 'Alert - Error 9002 - Transaction log full', @Normal_Delay_Between_Responses)
    ,(17173, 'Alert - Error 17173 - Ignored trace flag', @Long_Delay_Between_Responses)
    ,(17883, 'Alert - Error 17883 - Non-yielding Worker on Scheduler', @Medium_Delay_Between_Responses)
    ,(17884, 'Alert - Error 17884 - New queries assigned to process have not been picked up by a worker thread', @Medium_Delay_Between_Responses)
    ,(17887, 'Alert - Error 17887 - Worker appears to be non-yielding on Node', @Medium_Delay_Between_Responses)
    ,(17888, 'Alert - Error 17888 - All schedulers on Node appear deadlocked', @Medium_Delay_Between_Responses)
    ,(17890, 'Alert - Error 17890 - A significant part of sql server process memory has been paged out', @Long_Delay_Between_Responses)
    ,(17891, 'Alert - Error 17891 - Worker appears to be non-yielding on Node', @Medium_Delay_Between_Responses)

SELECT @TotalAlerts = COUNT(*) FROM @Alerts

IF NOT EXISTS(SELECT * FROM msdb.dbo.sysoperators WHERE name = @OperatorName)
BEGIN
    RAISERROR ('SQL Operator %s does not exist', 18, 16, @OperatorName);
    RETURN;
END
 
IF NOT EXISTS (SELECT * FROM msdb.dbo.syscategories
               WHERE category_class = 2  -- ALERT
               AND category_type = 3 AND name = @CategoryName)
BEGIN
    EXEC dbo.sp_add_category @class = N'ALERT', @type = N'NONE', @name = @CategoryName;
END
 
BEGIN TRY
 
    WHILE @Row <= @TotalAlerts 
    BEGIN
 
        SELECT
             @AlertName = @@SERVERNAME + ' - ' + AlertName 
            ,@SeverityNo = SeverityNo
            ,@ErrorNo = ErrorNo
            ,@DelayBetweenResponses = DelayBetweenResponses
        FROM
            @Alerts
        WHERE
            Id = @Row
 
        IF EXISTS (SELECT * FROM msdb.dbo.sysalerts WHERE [name] = @AlertName) 
        BEGIN
            EXEC msdb.dbo.sp_delete_alert @name = @AlertName
        END
 
        IF @SeverityNo IS NOT NULL 
        BEGIN
            EXEC msdb.dbo.sp_add_alert 
                @name = @AlertName,
                @message_id = 0, 
                @severity = @SeverityNo, 
                @enabled = 1, 
                @Delay_Between_Responses = @DelayBetweenResponses, 
                @include_event_description_in = 1, 
                @category_name = @CategoryName, 
                @job_id = N'00000000-0000-0000-0000-000000000000'
        END
 
        IF @ErrorNo IS NOT NULL 
        BEGIN
            -- Errors 855 and 856 require SQL Server 2012+ and Enterprise Edition
            -- [Also need Windows Server 2012+, and hardware that supports memory error correction]
            IF @ErrorNo NOT IN (855, 856) 
               OR (LEFT(CONVERT(CHAR(2),SERVERPROPERTY('ProductVersion')), 2) >= '11' AND SERVERPROPERTY('EngineEdition') = 3)
            BEGIN
                EXEC msdb.dbo.sp_add_alert 
                    @name = @AlertName,
                    @message_id = @ErrorNo, 
                    @severity = 0, 
                    @enabled = 1, 
                    @Delay_Between_Responses = @DelayBetweenResponses, 
                    @include_event_description_in = 1, 
                    @category_name = @CategoryName, 
                    @job_id = N'00000000-0000-0000-0000-000000000000'
            END
        END
 
        EXEC msdb.dbo.sp_add_notification @Alert_Name = @AlertName, @Operator_Name = @OperatorName, @notification_method = 1
 
        SELECT @Row = @Row + 1

    END

END TRY

BEGIN CATCH
    PRINT ERROR_MESSAGE()
 
    IF @@TRANCOUNT > 0 
    BEGIN
        ROLLBACK
    END
END CATCH
 
IF @@TRANCOUNT > 0 
BEGIN
    COMMIT
END