SQL – Mitch Wheat

Finding Columns with Skewed Data

October 15, 2024 by admin

Queries with parameter sensitive plans can perform poorly when an inappropriate query plan is used.

Even if your statistics are up to date, parameter sensitive plans can be caused by skewed data,
so performing a data skew analysis can identify which filter columns might be involved in poor query plans.

I’ve adapted the code found here into a SQL Server stored procedure that can be run across an entire database, a schema, a single table or just a single column.

It should be relatively easy to convert this to other RDBMS.

Here’s an example of the output when run on the Stackoverflow 2013 downloadable database (approximately 50GB):

	SET ANSI_NULLS ON
	GO
	SET QUOTED_IDENTIFIER ON
	GO

	-- Note: * Does not use Welford's numerically stable one-pass.
	-- This should not be a problem in the majority of cases.
	--
	-- Can be run for all tables' columns:
	-- exec dbo.CalculateSkewness NULL, NULL, NULL, 50, 0
	--
	-- or all tables' columns in a schema:
	-- exec dbo.CalculateSkewness 'dbo', NULL, NULL, 50, 0
	--
	-- or all columns in a single table,
	-- exec dbo.CalculateSkewness 'dbo', 'MyTable', NULL, 50, 0
	--
	-- or just a single column.
	-- exec dbo.CalculateSkewness 'dbo', 'MyTable', 'MyColumn', 50, 0
	--
	-- Might take a while on a large database.
	--
	-- Adapted from:
	-- https://gitlab.com/swasheck/statistics-scripts/blob/master/table%20analysis.sql
	--
	-- Refs:
	-- https://swasheck.wordpress.com/2016/04/06/skewed-data-finding-the-columns/
	-- https://brownmath.com/stat/shape.htm
	-- https://learnsql.com/blog/high-performance-statistical-queries-skewness-kurtosis/
	--
	--
	CREATE OR ALTER PROCEDURE dbo.CalculateSkew
	(
	@pSchemaName sysname = NULL,
	@pTableName sysname = NULL,
	@pColumnName sysname = NULL,
	@SamplePercent int = -1,
	@Debug bit = 1
	)
	AS
	BEGIN

	SET NOCOUNT ON;
	SET XACT_ABORT ON;

	-- Check samplePercent between 1 and 100 (or -1)
	if @SamplePercent < 0
	SET @SamplePercent = -1;
	else
	if @SamplePercent > 100
	SET @SamplePercent = 100;

	declare @TableRowCount bigint;
	declare @tallysql nvarchar(max);
	declare @sql nvarchar(max);
	DECLARE @TallyTable SYSNAME;

	drop table if exists #tmp_tally_table;
	drop table if exists #table_analysis;

	create table #tmp_tally_table
	(
	[key] nvarchar(max),
	x decimal(30,0),
	f decimal(30,0)
	);

	create table #table_analysis
	(
	schema_name sysname,
	table_name sysname,
	column_name sysname,
	b1 decimal(30,5),
	G1 decimal(30,5),
	ses decimal(30,5),
	distinct_values decimal(30,0),
	total_values decimal(30,0),
	distinct_ratio decimal(20,10),
	table_rows decimal(30,0),
	distinct_ratio_null decimal(20,10),
	Zg1 decimal(30,5),
	tallysql nvarchar(max)
	);

	DECLARE @SchemaName sysname, @TableName sysname, @ColumnName sysname;

	declare colcur cursor local fast_forward for
	SELECT
	sc.name,
	t.name,
	c.name
	FROM
	sys.tables t
	JOIN sys.schemas sc ON sc.schema_id = t.schema_id
	JOIN sys.columns c ON c.object_id = t.object_id
	JOIN sys.types ty ON ty.user_type_id = c.user_type_id
	WHERE
	sc.name != 'cdc'
	AND sc.name != 'history'
	AND t.name not like '#%' and t.name != 'awsdms_truncation_safeguard'
	AND c.is_identity = 0
	AND c.is_computed = 0
	AND c.is_filestream = 0
	AND c.is_xml_document = 0
	AND t.is_memory_optimized = 0
	AND ty.max_length != -1 -- skip (n)varchar(max) columns
	AND ty.max_length < 200 -- skip long text columns
	AND ty.user_type_id NOT IN (34, 35, 99)
	AND OBJECTPROPERTY(t.object_id, 'IsTable') = 1
	AND (@pSchemaName IS NULL OR sc.name = @pSchemaName)
	AND (@pTableName IS NULL OR t.name = @pTableName)
	AND (@pColumnName IS NULL OR c.name = @pColumnName)
	ORDER BY
	sc.name, t.name, c.name
	OPTION(RECOMPILE);

	open colcur
	fetch next from colcur into @SchemaName, @TableName, @ColumnName

	while @@fetch_status = 0
	begin

	if @Debug = 1
	print '--' + quotename(@SchemaName) + '.' + quotename(@TableName) + '.' + quotename(@ColumnName)

	select
	@TableRowCount = sum(rows)
	from
	sys.partitions p
	join sys.tables t on p.object_id = t.object_id
	join sys.schemas sc on sc.schema_id = t.schema_id
	where
	sc.name = @SchemaName
	and t.name = @TableName;

	/*
	SamplePercent:
	if the parameter is -1, use the average of the sample ratio of all stats on the table
	if the parameter is 0, use the max sample ratio of all stats on the table
	otherwise use the sample percent that was supplied
	*/
	select
	@SamplePercent =
	case
	when @SamplePercent = -1 then isnull(nullif(cast(round(avg((100. * sp.rows_sampled) / sp.unfiltered_rows),0) as int),0),1)
	when @SamplePercent = 0 then isnull(nullif(cast(round(max((100. * sp.rows_sampled) / sp.unfiltered_rows),0) as int),0),1)
	else @SamplePercent
	end
	from sys.tables t
	join sys.stats s on t.object_id = s.object_id
	join sys.columns c on t.object_id = c.object_id
	join sys.schemas sc on sc.schema_id = t.schema_id
	cross apply sys.dm_db_stats_properties(s.object_id, s.stats_id) sp
	where
	s.has_filter = 0
	and t.name = @TableName
	and sc.name = @SchemaName
	and c.name = @ColumnName
	and index_col(sc.name + '.' + t.name, s.stats_id, 1) = @ColumnName

	-- Generate the base distibution histogram: value, freq(value)
	set @tallysql =
	'select
	[key] = ' + quotename(@ColumnName) + ',
	x = dense_rank() over (order by ' + quotename(@ColumnName) + '),
	f = count_big(1)
	from ' + quotename(@SchemaName) + '.' + quotename(@TableName) + '
	tablesample (' + cast(@SamplePercent as nvarchar(3)) + ' percent)
	where ' + quotename(@ColumnName) + ' is not null
	group by ' + quotename(@ColumnName) + ';'

	set @sql = 'insert into #tmp_tally_table ' + @tallysql;

	if @Debug = 1
	print @sql;
	else
	exec sp_executesql @sql;

	/*
	Calculate skewness stats:
	b1 -> population skewness
	G1 -> sample skewness (corrects for sample bias)
	ses -> standard error of skew. This is a standard deviation calculation on the G1
	Zg1 -> test statistic which provides more meaning to the skewness calculation
	*/

	set @sql =
	'insert #table_analysis
	(
	schema_name,
	table_name,
	column_name,
	b1,
	G1,
	ses,
	distinct_values,
	total_values,
	distinct_ratio,
	table_rows,
	distinct_ratio_null,
	Zg1,
	tallysql
	)
	select
	''' + @SchemaName + ''',
	''' + @TableName + ''',
	''' + @ColumnName + ''',
	sk.b1,
	sk.G1,
	sk.ses,
	d,
	n,
	(100. * d)/n,
	' + cast(@TableRowCount as nvarchar(255)) + ',
	(100.*d)/' + cast(@TableRowCount as nvarchar(255)) + ',
	zG1 = case sk.ses when 0 then sk.G1 else sk.G1/sk.ses end,
	''' + @tallysql + '''
	from (
	select
	b1 =
	case
	when (power(m2,1.5)) = 0 then 0
	else m3 / (power(m2,1.5))
	end,
	G1 =
	case
	when n <= 2 then 0
	else
	(sqrt(1.(n(n-1)))/(n-2)) *
	case
	when (power(m2,1.5)) = 0 then 0
	else m3 / (power(m2,1.5))
	end
	end,
	ses = case
	when ((n-2.)(n+1.)(n+3.)) = 0 then 0
	else sqrt(((1.(6.n)(n-1.)))/((n-2.)(n+1.)*(n+3.)))
	end,
	d,n
	from (
	select
	n,
	d,
	m2 = sum(power((x-(sxf/n)),2)*f)/n,
	m3 = sum(power((x-(sxf/n)),3)*f)/n
	from (
	select
	x,
	f,
	sxf = 1. * sum(x*f) over(),
	n = sum(f) over(),
	d = count(x) over()
	from #tmp_tally_table
	) base
	group by n,d
	) agg
	) sk';

	if @Debug = 1
	print @sql;
	else
	exec sp_executesql @sql;

	set @sql = 'truncate table #tmp_tally_table;';
	if @Debug = 1
	print @sql;
	else
	exec sp_executesql @sql;

	fetch next from colcur into @SchemaName, @TableName, @ColumnName
	end
	close colcur;
	deallocate colcur;

	drop table if exists #tmp_tally_table;

	-- The critical value of Zg1 is approximately 2 (This is a two-tailed test of skewness ≠ 0 at roughly the 0.05 significance level)
	-- If Zg1 < −2, the population is very likely skewed negatively (though you don’t know by how much).
	-- If Zg1 is between −2 and +2, you can’t reach any conclusion about the skewness of the population: it might be symmetric, or it might be skewed in either direction.
	-- If Zg1 > 2, the population is very likely skewed positively (though you don’t know by how much).

	select
	*
	from
	#table_analysis
	where
	abs(Zg1) > 2
	and distinct_ratio <= 40
	order by
	abs(Zg1) desc;

	--drop table #table_analysis

	END

view raw CalculateSkew.sql hosted with

by GitHub

SQL Server: Compressing a Table and/or Indexes

May 10, 2024 by admin

I always forget whether the first syntax compresses the NC indexes as well, so posting here so I don’t forget again!

This compresses just the clustered index (i.e. the table data):

-- Just clustered index
ALTER TABLE dbo.Table
REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE);

This compresses all indexes including the clustered index:

-- All indexes including clustered index
ALTER INDEX ALL ON dbo.Table
REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE);

Tuning SQL Server Queries 101

May 31, 2021 by admin

First, get an actual query execution plan. Look for warnings in the query plan.

Basic

First look for large scans and lookups: these can often be resolved by creating a new index or extending an existing one with additional included columns. Seeks are usually preferable to scans.
Then look for significant variance of Actual versus Estimated row counts: you can provide the optimiser with more accurate information by updating statistics, creating new statistics objects, adding statistics on computed columns, or by breaking the query up into simpler parts. Might be caused by ‘parameter sniffing’.
Then look for expensive operators in the query plan, especially those that consume memory such as sorts and hashes. Sorting can sometimes be avoided by altering/adding indexes.

More Advanced

Joe Sack has an excellent walkthrough here: The Case of the Cardinality Estimate Red Herring

SQLFrontline Freemium Edition

May 6, 2021May 5, 2021 by admin

I’ve recently changed what I was calling the ‘Demo’ version of SQLFrontline, to a ‘Freemium’ model. The demo version only displayed one recommendation result in each of the four severity categories (Critical, High, Medium, Info).

The free version does not include all the features of the paid premium version obviously, but still provides some useful recommendations, providing advice on 40 checks.

Both use the same lightweight metadata collection.

The Free Version:

Performs 40 checks (out of 350+), but doesn’t show all affected objects if a check results in a recommendation
Deletes all collected metadata after collection
No reminder list is shown
Does not display a list of issues that are no longer present since last collection
Sends a single email for all servers
No database specific recommendations are made
Can only collect metadata for a limited number of servers

The Premium Version:

350+ checks performed across the categories of Configuration, Reliability, Performance, Security, Server Info, Table/Index Design
New checks are constantly being added
Reminder list of recommendations that have been made previously and not yet fixed
List of issues fixed compared to the last collection
Can choose how long to store collected metadata, so that point in time reports can be made, along with automated estimates of DB growth over time
Can send 1 email per server or a single email for all servers
Ability to ‘mute’ recommendations on an individual basis, or entire check (for non-critical checks)
No practical limit on the number of servers

If you want to try it out, click this link to request a free access token.

Once you have an access token, here’s how to run it: How to run SQLFrontline

Don’t Embed SQL into SSRS Reports

April 26, 2021 by admin

Reasons not to embed SQL in SSRS reports (.rdl) and create stored procedures instead:

Easier to version control in SCC
Better visibility in the database
Easier to debug
Easier to fix any performance problems (and to hot fix, if necessary)
Easier to re-use functionality (which is often lengthy to produce) with other systems/APIs etc
Stored Procedures get given a name that accurately describes what they do
Easier to gather execution statistics
Easy to see what parameters are used to filter
Can be secured with explicit permissions
Easier to write automated testing against a stored procedure

All seem fairly obvious, but it’s surprising how many people still embed SQL into SSRS reports.

Date and Time Dimension

April 20, 2021 by admin

Almost every fact table in a data warehouse uses a date (or calendar) dimension, because most measurements are defined at specific points in time. A flexible calendar date dimension is at the heart of most data warehouse systems; it provides easy navigation of a fact table through user familiar dates, such as weeks, months, ﬁscal periods and special days (today, weekends, holidays etc.).

I’ve created a date dimension generator here at Github

It targets SQL Server, but should be easy to convert to other RDBMS.

It features:

User defined start and end dates
Computed Easter dates (for years 1901 to 2099)
Computed Chinese New year dates for years 1971 to 2099.
Computed public holidays for US, UK, Canada, Ireland, Malta, Philippines, Australia (with state specific for WA, NSW, QLD, SA, VIC).
Date labels in US, UK and ISO formats.

Things to Note:

The [TodayFlag] needs to be updated once per day by a scheduled task (timezone dependent: might need a flag for each timezone).
If you use an unusual Fiscal year (say 5-4-4), it will need to be loaded from an external source (such as an Excel/Google spreadsheet).
The precise start date of the month of Ramadan is by proclamation, so these need to be added, year by year. It is possible to calculate but can be a day out, and can vary by region.

https://travel.stackexchange.com/questions/46148/how-to-calculate-when-ramadan-finishes

https://en.wikipedia.org/wiki/Ramadan_%28calendar_month%29

Babelfish for PostgreSQL

December 2, 2020 by admin

This has the capacity to be huge:

Babelfish for PostgreSQL is an Apache-2.0 open source project that adds a Microsoft SQL Server-compatible end-point to PostgreSQL to enable your PostgreSQL database to understand the SQL Server wire protocol and commonly used SQL Server commands. With Babelfish, applications that were originally built for SQL Server can work directly with PostgreSQL, with little to no code changes, and without changing database drivers.

Do You Name All Your SQL Server Database Constraints?

April 12, 2020 by admin

If you define a constraint without explicitly giving it a name, SQL Server will generate one for you.
You know the ones, they look something like this PK__MY_TABLE__3213E83FA7739BB4.

Why might that be a bad thing? It makes writing deployment scripts harder because you won’t know up front the names of constraints you might want to refer to.

Michael J Swart describes a query to discover the system generated names in your databases (with a small modification):

SELECT
    [Schema] = SCHEMA_NAME(o.schema_id),
    [System Generated Name] = OBJECT_NAME(o.object_id),
    [Parent Name] = OBJECT_NAME(o.parent_object_id),
    [Object Type] = o.type_desc
FROM
    sys.objects o
    JOIN sys.sysconstraints c ON o.object_id = c.constid
WHERE
    (status & 0x20000) > 0
    and o.is_ms_shipped = 0

According to the sys.sysconstraints documentation page:

This SQL Server 2000 system table is included as a view for backward compatibility. We recommend that you use the current SQL Server system views instead. To find the equivalent system view or views, see Mapping System Tables to System Views (Transact-SQL). This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature.

You can query the same information by using the individual views unioned together:

SELECT
    [Schema] = SCHEMA_NAME(schema_id),
    [System Generated Name] = OBJECT_NAME(object_id),
    [Parent Name] = OBJECT_NAME(parent_object_id),
    [Object Type] = type_desc
FROM sys.check_constraints 
WHERE is_system_named = 1
 
UNION ALL
 
SELECT
    [Schema] = SCHEMA_NAME(schema_id),
    [System Generated Name] = OBJECT_NAME(object_id),
    [Parent Name] = OBJECT_NAME(parent_object_id),
    [Object Type] = type_desc
FROM sys.default_constraints 
WHERE is_system_named = 1
 
UNION ALL
 
SELECT
    [Schema] = SCHEMA_NAME(schema_id),
    [System Generated Name] = OBJECT_NAME(object_id),
    [Parent Name] = OBJECT_NAME(parent_object_id),
    [Object Type] = type_desc
FROM sys.key_constraints 
WHERE is_system_named = 1
 
UNION ALL
 
SELECT
    [Schema] = SCHEMA_NAME(schema_id),
    [System Generated Name] = OBJECT_NAME(object_id),
    [Parent Name] = OBJECT_NAME(parent_object_id),
    [Object Type] = type_desc
FROM sys.foreign_keys  
WHERE is_system_named = 1

SQLFrontline: Server Overview

December 22, 2019 by admin

Have you just been given a bunch of SQL servers that you’re now responsible for? Do you want to get a really fast overview of each server’s hardware, SQL server version and service pack update, configuration, database sizes and usage, performance problems, weak passwords?

Want the results emailed to you in a prioritised, easy to read format?

SQLFrontline can do this with just a few commands. Behind the scenes it runs hundreds of lightweight, metadata collection queries against all the specified servers/databases (no user data is collected). SQLFrontline currently performs 300+ checks across the categories of Reliability, Performance, Configuration, Security and Database Design

SQLFrontline Case Study: Failing Backups

December 15, 2019December 15, 2019 by admin

Over the course of 20 years dealing with SQL server, I’ve come across failing backups more times than I’d like to recall. (And that’s not counting the times there were no backups setup in the first place!)

In the case of failing backups, backups were set up, checked to be working, and then at a later date subsequently failed to notify of backup failures for several reasons (non-exhaustive):

Configuration on the SMTP server changed, such as the allowed IP white list for forwarding, permissions changed, or the actual SMTP server changed.
Virus scanner configuration changed preventing emails to be sent.
AD group permissions changed or SQL server service identities changed.

In all these cases, backups were failing but no one was being alerted and no one was periodically checking the SQL agent logs.

SQLFrontline checks for no backups in the last 7 days, backups done without compression (compressed backups take up less space obviously, but are also faster to backup and restore), backups done without verifying page checksums, and backups done without encryption (if your version of SQL Server supports it). It also checks that you are periodically running DBCC CHECKDB to maintain database integrity, and whether any data corruption has been detected (automatically repaired or otherwise).

SQLFrontline currently performs 62 Reliability checks on each SQL Server you monitor, with 300+ checks performed across the categories of Reliability, Performance, Security, Configuration and Database Design.

Side Note: Another thing to consider is, do you delete older backups BEFORE making sure the latest backup succeeded? If so, you might end up with no recent local backups at all when your backup job starts failing… If you use Ola Hallegren’s maintenance solution scripts, this check is performed correctly for you.