Finding Columns with Skewed Data

Queries with parameter sensitive plans can perform poorly when an inappropriate query plan is used.

Even if your statistics are up to date, parameter sensitive plans can be caused by skewed data,
so performing a data skew analysis can identify which filter columns might be involved in poor query plans.

I’ve adapted the code found here into a SQL Server stored procedure that can be run across an entire database, a schema, a single table or just a single column.

It should be relatively easy to convert this to other RDBMS.

Here’s an example of the output when run on the Stackoverflow 2013 downloadable database (approximately 50GB):

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- Note: * Does not use Welford's numerically stable one-pass.
-- This should not be a problem in the majority of cases.
--
-- Can be run for all tables' columns:
-- exec dbo.CalculateSkewness NULL, NULL, NULL, 50, 0
--
-- or all tables' columns in a schema:
-- exec dbo.CalculateSkewness 'dbo', NULL, NULL, 50, 0
--
-- or all columns in a single table,
-- exec dbo.CalculateSkewness 'dbo', 'MyTable', NULL, 50, 0
--
-- or just a single column.
-- exec dbo.CalculateSkewness 'dbo', 'MyTable', 'MyColumn', 50, 0
--
-- Might take a while on a large database.
--
-- Adapted from:
-- https://gitlab.com/swasheck/statistics-scripts/blob/master/table%20analysis.sql
--
-- Refs:
-- https://swasheck.wordpress.com/2016/04/06/skewed-data-finding-the-columns/
-- https://brownmath.com/stat/shape.htm
-- https://learnsql.com/blog/high-performance-statistical-queries-skewness-kurtosis/
--
--
CREATE OR ALTER PROCEDURE dbo.CalculateSkew
(
@pSchemaName sysname = NULL,
@pTableName sysname = NULL,
@pColumnName sysname = NULL,
@SamplePercent int = -1,
@Debug bit = 1
)
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
-- Check samplePercent between 1 and 100 (or -1)
if @SamplePercent < 0
SET @SamplePercent = -1;
else
if @SamplePercent > 100
SET @SamplePercent = 100;
declare @TableRowCount bigint;
declare @tallysql nvarchar(max);
declare @sql nvarchar(max);
DECLARE @TallyTable SYSNAME;
drop table if exists #tmp_tally_table;
drop table if exists #table_analysis;
create table #tmp_tally_table
(
[key] nvarchar(max),
x decimal(30,0),
f decimal(30,0)
);
create table #table_analysis
(
schema_name sysname,
table_name sysname,
column_name sysname,
b1 decimal(30,5),
G1 decimal(30,5),
ses decimal(30,5),
distinct_values decimal(30,0),
total_values decimal(30,0),
distinct_ratio decimal(20,10),
table_rows decimal(30,0),
distinct_ratio_null decimal(20,10),
Zg1 decimal(30,5),
tallysql nvarchar(max)
);
DECLARE @SchemaName sysname, @TableName sysname, @ColumnName sysname;
declare colcur cursor local fast_forward for
SELECT
sc.name,
t.name,
c.name
FROM
sys.tables t
JOIN sys.schemas sc ON sc.schema_id = t.schema_id
JOIN sys.columns c ON c.object_id = t.object_id
JOIN sys.types ty ON ty.user_type_id = c.user_type_id
WHERE
sc.name != 'cdc'
AND sc.name != 'history'
AND t.name not like '#%' and t.name != 'awsdms_truncation_safeguard'
AND c.is_identity = 0
AND c.is_computed = 0
AND c.is_filestream = 0
AND c.is_xml_document = 0
AND t.is_memory_optimized = 0
AND ty.max_length != -1 -- skip (n)varchar(max) columns
AND ty.max_length < 200 -- skip long text columns
AND ty.user_type_id NOT IN (34, 35, 99)
AND OBJECTPROPERTY(t.object_id, 'IsTable') = 1
AND (@pSchemaName IS NULL OR sc.name = @pSchemaName)
AND (@pTableName IS NULL OR t.name = @pTableName)
AND (@pColumnName IS NULL OR c.name = @pColumnName)
ORDER BY
sc.name, t.name, c.name
OPTION(RECOMPILE);
open colcur
fetch next from colcur into @SchemaName, @TableName, @ColumnName
while @@fetch_status = 0
begin
if @Debug = 1
print '--' + quotename(@SchemaName) + '.' + quotename(@TableName) + '.' + quotename(@ColumnName)
select
@TableRowCount = sum(rows)
from
sys.partitions p
join sys.tables t on p.object_id = t.object_id
join sys.schemas sc on sc.schema_id = t.schema_id
where
sc.name = @SchemaName
and t.name = @TableName;
/*
SamplePercent:
if the parameter is -1, use the average of the sample ratio of all stats on the table
if the parameter is 0, use the max sample ratio of all stats on the table
otherwise use the sample percent that was supplied
*/
select
@SamplePercent =
case
when @SamplePercent = -1 then isnull(nullif(cast(round(avg((100. * sp.rows_sampled) / sp.unfiltered_rows),0) as int),0),1)
when @SamplePercent = 0 then isnull(nullif(cast(round(max((100. * sp.rows_sampled) / sp.unfiltered_rows),0) as int),0),1)
else @SamplePercent
end
from sys.tables t
join sys.stats s on t.object_id = s.object_id
join sys.columns c on t.object_id = c.object_id
join sys.schemas sc on sc.schema_id = t.schema_id
cross apply sys.dm_db_stats_properties(s.object_id, s.stats_id) sp
where
s.has_filter = 0
and t.name = @TableName
and sc.name = @SchemaName
and c.name = @ColumnName
and index_col(sc.name + '.' + t.name, s.stats_id, 1) = @ColumnName
-- Generate the base distibution histogram: value, freq(value)
set @tallysql =
'select
[key] = ' + quotename(@ColumnName) + ',
x = dense_rank() over (order by ' + quotename(@ColumnName) + '),
f = count_big(1)
from ' + quotename(@SchemaName) + '.' + quotename(@TableName) + '
tablesample (' + cast(@SamplePercent as nvarchar(3)) + ' percent)
where ' + quotename(@ColumnName) + ' is not null
group by ' + quotename(@ColumnName) + ';'
set @sql = 'insert into #tmp_tally_table ' + @tallysql;
if @Debug = 1
print @sql;
else
exec sp_executesql @sql;
/*
Calculate skewness stats:
b1 -> population skewness
G1 -> sample skewness (corrects for sample bias)
ses -> standard error of skew. This is a standard deviation calculation on the G1
Zg1 -> test statistic which provides more meaning to the skewness calculation
*/
set @sql =
'insert #table_analysis
(
schema_name,
table_name,
column_name,
b1,
G1,
ses,
distinct_values,
total_values,
distinct_ratio,
table_rows,
distinct_ratio_null,
Zg1,
tallysql
)
select
''' + @SchemaName + ''',
''' + @TableName + ''',
''' + @ColumnName + ''',
sk.b1,
sk.G1,
sk.ses,
d,
n,
(100. * d)/n,
' + cast(@TableRowCount as nvarchar(255)) + ',
(100.*d)/' + cast(@TableRowCount as nvarchar(255)) + ',
zG1 = case sk.ses when 0 then sk.G1 else sk.G1/sk.ses end,
''' + @tallysql + '''
from (
select
b1 =
case
when (power(m2,1.5)) = 0 then 0
else m3 / (power(m2,1.5))
end,
G1 =
case
when n <= 2 then 0
else
(sqrt(1.*(n*(n-1)))/(n-2)) *
case
when (power(m2,1.5)) = 0 then 0
else m3 / (power(m2,1.5))
end
end,
ses = case
when ((n-2.)*(n+1.)*(n+3.)) = 0 then 0
else sqrt(((1.*(6.*n)*(n-1.)))/((n-2.)*(n+1.)*(n+3.)))
end,
d,n
from (
select
n,
d,
m2 = sum(power((x-(sxf/n)),2)*f)/n,
m3 = sum(power((x-(sxf/n)),3)*f)/n
from (
select
x,
f,
sxf = 1. * sum(x*f) over(),
n = sum(f) over(),
d = count(x) over()
from #tmp_tally_table
) base
group by n,d
) agg
) sk';
if @Debug = 1
print @sql;
else
exec sp_executesql @sql;
set @sql = 'truncate table #tmp_tally_table;';
if @Debug = 1
print @sql;
else
exec sp_executesql @sql;
fetch next from colcur into @SchemaName, @TableName, @ColumnName
end
close colcur;
deallocate colcur;
drop table if exists #tmp_tally_table;
-- The critical value of Zg1 is approximately 2 (This is a two-tailed test of skewness ≠ 0 at roughly the 0.05 significance level)
-- If Zg1 < −2, the population is very likely skewed negatively (though you don’t know by how much).
-- If Zg1 is between −2 and +2, you can’t reach any conclusion about the skewness of the population: it might be symmetric, or it might be skewed in either direction.
-- If Zg1 > 2, the population is very likely skewed positively (though you don’t know by how much).
select
*
from
#table_analysis
where
abs(Zg1) > 2
and distinct_ratio <= 40
order by
abs(Zg1) desc;
--drop table #table_analysis
END

SQL Server: Compressing a Table and/or Indexes

I always forget whether the first syntax compresses the NC indexes as well, so posting here so I don’t forget again!

This compresses just the clustered index (i.e. the table data):

1
2
3
4
-- Just clustered index
ALTER TABLE dbo.Table
REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE);

This compresses all indexes including the clustered index:

1
2
3
4
-- All indexes including clustered index
ALTER INDEX ALL ON dbo.Table
REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE);

Tuning SQL Server Queries 101

First, get an actual query execution plan. Look for warnings in the query plan.

Basic

  • First look for large scans and lookups: these can often be resolved by creating a new index or extending an existing one with additional included columns. Seeks are usually preferable to scans.
  • Then look for significant variance of Actual versus Estimated row counts: you can provide the optimiser with more accurate information by updating statistics, creating new statistics objects, adding statistics on computed columns, or by breaking the query up into simpler parts. Might be caused by ‘parameter sniffing’.
  • Then look for expensive operators in the query plan, especially those that consume memory such as sorts and hashes. Sorting can sometimes be avoided by altering/adding indexes.

More Advanced

Joe Sack has an excellent walkthrough here: The Case of the Cardinality Estimate Red Herring

SQLFrontline Freemium Edition

I’ve recently changed what I was calling the ‘Demo’ version of SQLFrontline, to a ‘Freemium’ model. The demo version only displayed one recommendation result in each of the four severity categories (Critical, High, Medium, Info).

The free version does not include all the features of the paid premium version obviously, but still provides some useful recommendations, providing advice on 40 checks.

Both use the same lightweight metadata collection.

The Free Version:

  • Performs 40 checks (out of 350+), but doesn’t show all affected objects if a check results in a recommendation
  • Deletes all collected metadata after collection
  • No reminder list is shown
  • Does not display a list of issues that are no longer present since last collection
  • Sends a single email for all servers
  • No database specific recommendations are made
  • Can only collect metadata for a limited number of servers

The Premium Version:

  • 350+ checks performed across the categories of Configuration, Reliability, Performance, Security, Server Info, Table/Index Design
  • New checks are constantly being added
  • Reminder list of recommendations that have been made previously and not yet fixed
  • List of issues fixed compared to the last collection
  • Can choose how long to store collected metadata, so that point in time reports can be made, along with automated estimates of DB growth over time
  • Can send 1 email per server or a single email for all servers
  • Ability to ‘mute’ recommendations on an individual basis, or entire check (for non-critical checks)
  • No practical limit on the number of servers

If you want to try it out, click this link to request a free access token.

Once you have an access token, here’s how to run it: How to run SQLFrontline

Don’t Embed SQL into SSRS Reports

Reasons not to embed SQL in SSRS reports (.rdl) and create stored procedures instead:

  • Easier to version control in SCC
  • Better visibility in the database
  • Easier to debug
  • Easier to fix any performance problems (and to hot fix, if necessary)
  • Easier to re-use functionality (which is often lengthy to produce) with other systems/APIs etc
  • Stored Procedures get given a name that accurately describes what they do
  • Easier to gather execution statistics
  • Easy to see what parameters are used to filter
  • Can be secured with explicit permissions
  • Easier to write automated testing against a stored procedure

All seem fairly obvious, but it’s surprising how many people still embed SQL into SSRS reports.

Date and Time Dimension

Almost every fact table in a data warehouse uses a date (or calendar) dimension, because most measurements are defined at specific points in time. A flexible calendar date dimension is at the heart of most data warehouse systems; it provides easy navigation of a fact table through user familiar dates, such as weeks, months, fiscal periods and special days (today, weekends, holidays etc.).

I’ve created a date dimension generator here at Github

It targets SQL Server, but should be easy to convert to other RDBMS.

It features:

  • User defined start and end dates
  • Computed Easter dates (for years 1901 to 2099)
  • Computed Chinese New year dates for years 1971 to 2099.
  • Computed public holidays for US, UK, Canada, Ireland, Malta, Philippines, Australia (with state specific for WA, NSW, QLD, SA, VIC).
  • Date labels in US, UK and ISO formats.

Things to Note:

  1. The [TodayFlag] needs to be updated once per day by a scheduled task (timezone dependent: might need a flag for each timezone).

  2. If you use an unusual Fiscal year (say 5-4-4), it will need to be loaded from an external source (such as an Excel/Google spreadsheet).

  3. The precise start date of the month of Ramadan is by proclamation, so these need to be added, year by year. It is possible to calculate but can be a day out, and can vary by region.

    https://travel.stackexchange.com/questions/46148/how-to-calculate-when-ramadan-finishes

    https://en.wikipedia.org/wiki/Ramadan_%28calendar_month%29

Babelfish for PostgreSQL

This has the capacity to be huge:

Babelfish for PostgreSQL is an Apache-2.0 open source project that adds a Microsoft SQL Server-compatible end-point to PostgreSQL to enable your PostgreSQL database to understand the SQL Server wire protocol and commonly used SQL Server commands. With Babelfish, applications that were originally built for SQL Server can work directly with PostgreSQL, with little to no code changes, and without changing database drivers.

Do You Name All Your SQL Server Database Constraints?

If you define a constraint without explicitly giving it a name, SQL Server will generate one for you.
You know the ones, they look something like this PK__MY_TABLE__3213E83FA7739BB4.

Why might that be a bad thing? It makes writing deployment scripts harder because you won’t know up front the names of constraints you might want to refer to.

Michael J Swart describes a query to discover the system generated names in your databases (with a small modification):

SELECT
    [Schema] = SCHEMA_NAME(o.schema_id),
    [System Generated Name] = OBJECT_NAME(o.object_id),
    [Parent Name] = OBJECT_NAME(o.parent_object_id),
    [Object Type] = o.type_desc
FROM
    sys.objects o
    JOIN sys.sysconstraints c ON o.object_id = c.constid
WHERE
    (status & 0x20000) > 0
    and o.is_ms_shipped = 0

According to the sys.sysconstraints documentation page:

This SQL Server 2000 system table is included as a view for backward compatibility. We recommend that you use the current SQL Server system views instead. To find the equivalent system view or views, see Mapping System Tables to System Views (Transact-SQL). This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature.

You can query the same information by using the individual views unioned together:

SELECT
    [Schema] = SCHEMA_NAME(schema_id),
    [System Generated Name] = OBJECT_NAME(object_id),
    [Parent Name] = OBJECT_NAME(parent_object_id),
    [Object Type] = type_desc
FROM sys.check_constraints
WHERE is_system_named = 1
 
UNION ALL
 
SELECT
    [Schema] = SCHEMA_NAME(schema_id),
    [System Generated Name] = OBJECT_NAME(object_id),
    [Parent Name] = OBJECT_NAME(parent_object_id),
    [Object Type] = type_desc
FROM sys.default_constraints
WHERE is_system_named = 1
 
UNION ALL
 
SELECT
    [Schema] = SCHEMA_NAME(schema_id),
    [System Generated Name] = OBJECT_NAME(object_id),
    [Parent Name] = OBJECT_NAME(parent_object_id),
    [Object Type] = type_desc
FROM sys.key_constraints
WHERE is_system_named = 1
 
UNION ALL
 
SELECT
    [Schema] = SCHEMA_NAME(schema_id),
    [System Generated Name] = OBJECT_NAME(object_id),
    [Parent Name] = OBJECT_NAME(parent_object_id),
    [Object Type] = type_desc
FROM sys.foreign_keys 
WHERE is_system_named = 1

SQLFrontline: Server Overview

Have you just been given a bunch of SQL servers that you’re now responsible for? Do you want to get a really fast overview of each server’s hardware, SQL server version and service pack update, configuration, database sizes and usage, performance problems, weak passwords?

Want the results emailed to you in a prioritised, easy to read format?

SQLFrontline can do this with just a few commands. Behind the scenes it runs hundreds of lightweight, metadata collection queries against all the specified servers/databases (no user data is collected). SQLFrontline currently performs 300+ checks across the categories of Reliability, Performance, Configuration, Security and Database Design

SQLFrontline Case Study: Failing Backups

Over the course of 20 years dealing with SQL server, I’ve come across failing backups more times than I’d like to recall. (And that’s not counting the times there were no backups setup in the first place!)

In the case of failing backups, backups were set up, checked to be working, and then at a later date subsequently failed to notify of backup failures for several reasons (non-exhaustive):

  1. Configuration on the SMTP server changed, such as the allowed IP white list for forwarding, permissions changed, or the actual SMTP server changed.
  2. Virus scanner configuration changed preventing emails to be sent.
  3. AD group permissions changed or SQL server service identities changed.

In all these cases, backups were failing but no one was being alerted and no one was periodically checking the SQL agent logs.

SQLFrontline checks for no backups in the last 7 days, backups done without compression (compressed backups take up less space obviously, but are also faster to backup and restore), backups done without verifying page checksums, and backups done without encryption (if your version of SQL Server supports it). It also checks that you are periodically running DBCC CHECKDB to maintain database integrity, and whether any data corruption has been detected (automatically repaired or otherwise).

SQLFrontline currently performs 62 Reliability checks on each SQL Server you monitor, with 300+ checks performed across the categories of Reliability, Performance, Security, Configuration and Database Design.

Side Note: Another thing to consider is, do you delete older backups BEFORE making sure the latest backup succeeded? If so, you might end up with no recent local backups at all when your backup job starts failing… If you use Ola Hallegren’s maintenance solution scripts, this check is performed correctly for you.