Uncategorized – Mitch Wheat

Finding Columns with Skewed Data

October 15, 2024 by admin

Queries with parameter sensitive plans can perform poorly when an inappropriate query plan is used.

Even if your statistics are up to date, parameter sensitive plans can be caused by skewed data,
so performing a data skew analysis can identify which filter columns might be involved in poor query plans.

I’ve adapted the code found here into a SQL Server stored procedure that can be run across an entire database, a schema, a single table or just a single column.

It should be relatively easy to convert this to other RDBMS.

Here’s an example of the output when run on the Stackoverflow 2013 downloadable database (approximately 50GB):

	SET ANSI_NULLS ON
	GO
	SET QUOTED_IDENTIFIER ON
	GO

	-- Note: * Does not use Welford's numerically stable one-pass.
	-- This should not be a problem in the majority of cases.
	--
	-- Can be run for all tables' columns:
	-- exec dbo.CalculateSkewness NULL, NULL, NULL, 50, 0
	--
	-- or all tables' columns in a schema:
	-- exec dbo.CalculateSkewness 'dbo', NULL, NULL, 50, 0
	--
	-- or all columns in a single table,
	-- exec dbo.CalculateSkewness 'dbo', 'MyTable', NULL, 50, 0
	--
	-- or just a single column.
	-- exec dbo.CalculateSkewness 'dbo', 'MyTable', 'MyColumn', 50, 0
	--
	-- Might take a while on a large database.
	--
	-- Adapted from:
	-- https://gitlab.com/swasheck/statistics-scripts/blob/master/table%20analysis.sql
	--
	-- Refs:
	-- https://swasheck.wordpress.com/2016/04/06/skewed-data-finding-the-columns/
	-- https://brownmath.com/stat/shape.htm
	-- https://learnsql.com/blog/high-performance-statistical-queries-skewness-kurtosis/
	--
	--
	CREATE OR ALTER PROCEDURE dbo.CalculateSkew
	(
	@pSchemaName sysname = NULL,
	@pTableName sysname = NULL,
	@pColumnName sysname = NULL,
	@SamplePercent int = -1,
	@Debug bit = 1
	)
	AS
	BEGIN

	SET NOCOUNT ON;
	SET XACT_ABORT ON;

	-- Check samplePercent between 1 and 100 (or -1)
	if @SamplePercent < 0
	SET @SamplePercent = -1;
	else
	if @SamplePercent > 100
	SET @SamplePercent = 100;

	declare @TableRowCount bigint;
	declare @tallysql nvarchar(max);
	declare @sql nvarchar(max);
	DECLARE @TallyTable SYSNAME;

	drop table if exists #tmp_tally_table;
	drop table if exists #table_analysis;

	create table #tmp_tally_table
	(
	[key] nvarchar(max),
	x decimal(30,0),
	f decimal(30,0)
	);

	create table #table_analysis
	(
	schema_name sysname,
	table_name sysname,
	column_name sysname,
	b1 decimal(30,5),
	G1 decimal(30,5),
	ses decimal(30,5),
	distinct_values decimal(30,0),
	total_values decimal(30,0),
	distinct_ratio decimal(20,10),
	table_rows decimal(30,0),
	distinct_ratio_null decimal(20,10),
	Zg1 decimal(30,5),
	tallysql nvarchar(max)
	);

	DECLARE @SchemaName sysname, @TableName sysname, @ColumnName sysname;

	declare colcur cursor local fast_forward for
	SELECT
	sc.name,
	t.name,
	c.name
	FROM
	sys.tables t
	JOIN sys.schemas sc ON sc.schema_id = t.schema_id
	JOIN sys.columns c ON c.object_id = t.object_id
	JOIN sys.types ty ON ty.user_type_id = c.user_type_id
	WHERE
	sc.name != 'cdc'
	AND sc.name != 'history'
	AND t.name not like '#%' and t.name != 'awsdms_truncation_safeguard'
	AND c.is_identity = 0
	AND c.is_computed = 0
	AND c.is_filestream = 0
	AND c.is_xml_document = 0
	AND t.is_memory_optimized = 0
	AND ty.max_length != -1 -- skip (n)varchar(max) columns
	AND ty.max_length < 200 -- skip long text columns
	AND ty.user_type_id NOT IN (34, 35, 99)
	AND OBJECTPROPERTY(t.object_id, 'IsTable') = 1
	AND (@pSchemaName IS NULL OR sc.name = @pSchemaName)
	AND (@pTableName IS NULL OR t.name = @pTableName)
	AND (@pColumnName IS NULL OR c.name = @pColumnName)
	ORDER BY
	sc.name, t.name, c.name
	OPTION(RECOMPILE);

	open colcur
	fetch next from colcur into @SchemaName, @TableName, @ColumnName

	while @@fetch_status = 0
	begin

	if @Debug = 1
	print '--' + quotename(@SchemaName) + '.' + quotename(@TableName) + '.' + quotename(@ColumnName)

	select
	@TableRowCount = sum(rows)
	from
	sys.partitions p
	join sys.tables t on p.object_id = t.object_id
	join sys.schemas sc on sc.schema_id = t.schema_id
	where
	sc.name = @SchemaName
	and t.name = @TableName;

	/*
	SamplePercent:
	if the parameter is -1, use the average of the sample ratio of all stats on the table
	if the parameter is 0, use the max sample ratio of all stats on the table
	otherwise use the sample percent that was supplied
	*/
	select
	@SamplePercent =
	case
	when @SamplePercent = -1 then isnull(nullif(cast(round(avg((100. * sp.rows_sampled) / sp.unfiltered_rows),0) as int),0),1)
	when @SamplePercent = 0 then isnull(nullif(cast(round(max((100. * sp.rows_sampled) / sp.unfiltered_rows),0) as int),0),1)
	else @SamplePercent
	end
	from sys.tables t
	join sys.stats s on t.object_id = s.object_id
	join sys.columns c on t.object_id = c.object_id
	join sys.schemas sc on sc.schema_id = t.schema_id
	cross apply sys.dm_db_stats_properties(s.object_id, s.stats_id) sp
	where
	s.has_filter = 0
	and t.name = @TableName
	and sc.name = @SchemaName
	and c.name = @ColumnName
	and index_col(sc.name + '.' + t.name, s.stats_id, 1) = @ColumnName

	-- Generate the base distibution histogram: value, freq(value)
	set @tallysql =
	'select
	[key] = ' + quotename(@ColumnName) + ',
	x = dense_rank() over (order by ' + quotename(@ColumnName) + '),
	f = count_big(1)
	from ' + quotename(@SchemaName) + '.' + quotename(@TableName) + '
	tablesample (' + cast(@SamplePercent as nvarchar(3)) + ' percent)
	where ' + quotename(@ColumnName) + ' is not null
	group by ' + quotename(@ColumnName) + ';'

	set @sql = 'insert into #tmp_tally_table ' + @tallysql;

	if @Debug = 1
	print @sql;
	else
	exec sp_executesql @sql;

	/*
	Calculate skewness stats:
	b1 -> population skewness
	G1 -> sample skewness (corrects for sample bias)
	ses -> standard error of skew. This is a standard deviation calculation on the G1
	Zg1 -> test statistic which provides more meaning to the skewness calculation
	*/

	set @sql =
	'insert #table_analysis
	(
	schema_name,
	table_name,
	column_name,
	b1,
	G1,
	ses,
	distinct_values,
	total_values,
	distinct_ratio,
	table_rows,
	distinct_ratio_null,
	Zg1,
	tallysql
	)
	select
	''' + @SchemaName + ''',
	''' + @TableName + ''',
	''' + @ColumnName + ''',
	sk.b1,
	sk.G1,
	sk.ses,
	d,
	n,
	(100. * d)/n,
	' + cast(@TableRowCount as nvarchar(255)) + ',
	(100.*d)/' + cast(@TableRowCount as nvarchar(255)) + ',
	zG1 = case sk.ses when 0 then sk.G1 else sk.G1/sk.ses end,
	''' + @tallysql + '''
	from (
	select
	b1 =
	case
	when (power(m2,1.5)) = 0 then 0
	else m3 / (power(m2,1.5))
	end,
	G1 =
	case
	when n <= 2 then 0
	else
	(sqrt(1.(n(n-1)))/(n-2)) *
	case
	when (power(m2,1.5)) = 0 then 0
	else m3 / (power(m2,1.5))
	end
	end,
	ses = case
	when ((n-2.)(n+1.)(n+3.)) = 0 then 0
	else sqrt(((1.(6.n)(n-1.)))/((n-2.)(n+1.)*(n+3.)))
	end,
	d,n
	from (
	select
	n,
	d,
	m2 = sum(power((x-(sxf/n)),2)*f)/n,
	m3 = sum(power((x-(sxf/n)),3)*f)/n
	from (
	select
	x,
	f,
	sxf = 1. * sum(x*f) over(),
	n = sum(f) over(),
	d = count(x) over()
	from #tmp_tally_table
	) base
	group by n,d
	) agg
	) sk';

	if @Debug = 1
	print @sql;
	else
	exec sp_executesql @sql;

	set @sql = 'truncate table #tmp_tally_table;';
	if @Debug = 1
	print @sql;
	else
	exec sp_executesql @sql;

	fetch next from colcur into @SchemaName, @TableName, @ColumnName
	end
	close colcur;
	deallocate colcur;

	drop table if exists #tmp_tally_table;

	-- The critical value of Zg1 is approximately 2 (This is a two-tailed test of skewness ≠ 0 at roughly the 0.05 significance level)
	-- If Zg1 < −2, the population is very likely skewed negatively (though you don’t know by how much).
	-- If Zg1 is between −2 and +2, you can’t reach any conclusion about the skewness of the population: it might be symmetric, or it might be skewed in either direction.
	-- If Zg1 > 2, the population is very likely skewed positively (though you don’t know by how much).

	select
	*
	from
	#table_analysis
	where
	abs(Zg1) > 2
	and distinct_ratio <= 40
	order by
	abs(Zg1) desc;

	--drop table #table_analysis

	END

view raw CalculateSkew.sql hosted with

by GitHub

Solving the Monty Hall Problem using Bayes Theorem

January 1, 2023 by admin

The ‘Monty Hall’ problem is best described by the wikipedia entry:

The Monty Hall problem is a probability puzzle, loosely based on the American television game show Let’s Make a Deal and named after its original host, Monty Hall. The problem was originally posed (and solved) in a letter by Steve Selvin to the American Statistician in 1975.It became famous as a question from reader Craig F. Whitaker’s letter quoted in Marilyn vos Savant‘s “Ask Marilyn” column in Parade magazine in 1990.

Suppose you’re on a game show, and you’re given the choice of three doors A, B and C.

Behind one door is a car; behind the other two are goats. You pick a door, say door A, and the host, who knows what’s behind the doors, opens another door, say door B, which has a goat. He then says to you, “Do you want to change your choice to door C?”

Is it to your advantage to switch? (It is!)

Many readers of vos Savant’s column refused to believe switching is beneficial and rejected her explanation. After the problem appeared in Parade, approximately 10,000 readers, including nearly 1,000 with PhDs, wrote to the magazine, most of them calling vos Savant wrong.^[4] Even when given explanations, simulations, and formal mathematical proofs, many people still did not accept that switching is the best strategy.^[5]Paul Erdős, one of the most prolific mathematicians in history, remained unconvinced until he was shown a computer simulation demonstrating vos Savant’s predicted result.^[6]

First we need to define some notation:

A, B – events
P(A) – the probability of event A occurring
P(B) – the probability of event B occurring
P(A | B) – the probability of event A occurring, given that event B has already occurred
P(B | A) – the probability of event B occurring, given that event A has already occurred

Bayes Theorem is defined as:

The following reasoning is from Julian Havil’s book “Impossible?”

Assign symbols to the events:

A – the event “car is behind door A”
B – the event “car is behind door B”
C – the event “car is behind door C”
M_A – the event “Monty opens door A” … similarly for M_B , M_C

Assume door A is chosen initially by the player, so Monty can open door B or C:

P(M_B| A) = ½ , P(M_B| B) = 0, P(M_B| C) = 1

So, since A, B and C are mutually exclusive events:

P(M_B) = P(M_B| A)P(A) + P(M_B| B)P(B) + P(M_B| C)P(C) = ½ x ⅓ + 0 x ⅓ + 1 x ⅓ = ½

Now, the player can stick or change. If they stick with door A, their probability of winning the car is:

P(A | M_B) = P(M_B| A)P(A) / P(M_B) = (½ x ⅓) / ½ = ⅓

If they switch to door C, their probability of winning the car is:

P(C | M_B) = P(M_B| C)P(C) / P(M_B) = (1 x ⅓) / ½ = ⅔

.NET: Disable Insecure TLS protocols

September 16, 2021 by admin

TLS1.1 and TLS1.0 (and lower) protocols are insecure and should no longer be used.

For .NET 4.7 or later, you do not need to set System.Net.ServicePointManager.SecurityProtocol. The default value (SecurityProtocolType.SystemDefault) allows the operating system to use whatever versions it has been configured for, including any new versions that may not have existed at the time your application was created.

If you want to explicitly code this in .NET, rather than specify the allowed protocols, disable the disallowed protocols before making any connections:

// TLS must be 1.2 or greater. Disable SSL3, TLS1.0 and TLS1.1 [Note: this is the default behaviour for .NET 4.7 or later] 
ServicePointManager.SecurityProtocol &= (~SecurityProtocolType.Ssl3 & ~SecurityProtocolType.Tls & ~SecurityProtocolType.Tls11);

Recurrent Neural Networks

March 4, 2018 by admin

Very nice explanation of Recurrent Neural Networks (RNNs): The Unreasonable Effectiveness of Recurrent Neural Networks and LSTMs (Long Short-Term Memory): Understanding LSTM Networks

Class Central’s Best Online Courses of 2017

March 2, 2018 by admin

Some interesting and free learning courses here

Installing TensorFlow with GPU support on Windows 10

February 6, 2018 by admin

If you have a high end NVidia graphics card and you’re investigating data science with Keras+Tensorflow, then you obviously want Tensorflow to take advantage of your GPU (training times for deep neural networks can be 10 – 15 times faster even when compared to the latest CPUs).

Getting it all working can be tricky: I found this guide that explains the steps: Installing TensorFlow with GPU on Windows 10

Here’s another: How to run TensorFlow with GPU on Windows 10 in a Jupyter Notebook

Free e-Learning Books at Packt

January 25, 2018 by admin

Just came across this link at Packt: https://www.packtpub.com/packt/offers/free-learning

SQLDiagCmd Updated

December 16, 2017 by admin

I’ve updated SQLDiagCmd, my standalone executable for running any or all of Glenn Berry’s excellent SQL Server DMV diagnostic scripts.

As well as being able to target multiple servers and multiple databases, it now also has the option to exclude specified queries from being executed (such as those that might take some time to execute on large very databases or busy server instances).

The source code is available on GitHub and you can download the executable directly from these links:

SQLDiagCmd.zip

SQLDiagUI.zip

A recursive C# function

December 14, 2017 by admin

I was searching through email today looking for a LINQPad snippet that a colleague, James Miles, wrote some time ago, one which we used to generate the scripts for a production SQL Server database + transaction log point in time restore after IT had a little SAN mishap!

In doing so, I came across this gem from James: Solving Puzzles in C#: Poker Hands, which is not just a great example of writing a recursive function but of problem solving in general. [Where I used to work, we often used to have a Friday puzzle where I tried to come up with or find puzzles that wouldn’t be easy to solve by brute force. This was one of the many times I was thwarted by James and others!]

SQL Server: A more useful CXPacket Waits…

November 14, 2017 by admin

Starting with the upcoming SQL Server 2017 CU3 and SQL Server 2016 SP2 releases, CXPACKET waits are split into an actionable wait (CXPACKET) and a negligible wait (CXCONSUMER).

(these wait types are already present in Azure SQL Database).

Making parallelism waits actionable