Mitch Wheat

Finding Columns with Skewed Data

October 15, 2024 by admin

Queries with parameter sensitive plans can perform poorly when an inappropriate query plan is used.

Even if your statistics are up to date, parameter sensitive plans can be caused by skewed data,
so performing a data skew analysis can identify which filter columns might be involved in poor query plans.

I’ve adapted the code found here into a SQL Server stored procedure that can be run across an entire database, a schema, a single table or just a single column.

It should be relatively easy to convert this to other RDBMS.

Here’s an example of the output when run on the Stackoverflow 2013 downloadable database (approximately 50GB):

	SET ANSI_NULLS ON
	GO
	SET QUOTED_IDENTIFIER ON
	GO

	-- Note: * Does not use Welford's numerically stable one-pass.
	-- This should not be a problem in the majority of cases.
	--
	-- Can be run for all tables' columns:
	-- exec dbo.CalculateSkewness NULL, NULL, NULL, 50, 0
	--
	-- or all tables' columns in a schema:
	-- exec dbo.CalculateSkewness 'dbo', NULL, NULL, 50, 0
	--
	-- or all columns in a single table,
	-- exec dbo.CalculateSkewness 'dbo', 'MyTable', NULL, 50, 0
	--
	-- or just a single column.
	-- exec dbo.CalculateSkewness 'dbo', 'MyTable', 'MyColumn', 50, 0
	--
	-- Might take a while on a large database.
	--
	-- Adapted from:
	-- https://gitlab.com/swasheck/statistics-scripts/blob/master/table%20analysis.sql
	--
	-- Refs:
	-- https://swasheck.wordpress.com/2016/04/06/skewed-data-finding-the-columns/
	-- https://brownmath.com/stat/shape.htm
	-- https://learnsql.com/blog/high-performance-statistical-queries-skewness-kurtosis/
	--
	--
	CREATE OR ALTER PROCEDURE dbo.CalculateSkew
	(
	@pSchemaName sysname = NULL,
	@pTableName sysname = NULL,
	@pColumnName sysname = NULL,
	@SamplePercent int = -1,
	@Debug bit = 1
	)
	AS
	BEGIN

	SET NOCOUNT ON;
	SET XACT_ABORT ON;

	-- Check samplePercent between 1 and 100 (or -1)
	if @SamplePercent < 0
	SET @SamplePercent = -1;
	else
	if @SamplePercent > 100
	SET @SamplePercent = 100;

	declare @TableRowCount bigint;
	declare @tallysql nvarchar(max);
	declare @sql nvarchar(max);
	DECLARE @TallyTable SYSNAME;

	drop table if exists #tmp_tally_table;
	drop table if exists #table_analysis;

	create table #tmp_tally_table
	(
	[key] nvarchar(max),
	x decimal(30,0),
	f decimal(30,0)
	);

	create table #table_analysis
	(
	schema_name sysname,
	table_name sysname,
	column_name sysname,
	b1 decimal(30,5),
	G1 decimal(30,5),
	ses decimal(30,5),
	distinct_values decimal(30,0),
	total_values decimal(30,0),
	distinct_ratio decimal(20,10),
	table_rows decimal(30,0),
	distinct_ratio_null decimal(20,10),
	Zg1 decimal(30,5),
	tallysql nvarchar(max)
	);

	DECLARE @SchemaName sysname, @TableName sysname, @ColumnName sysname;

	declare colcur cursor local fast_forward for
	SELECT
	sc.name,
	t.name,
	c.name
	FROM
	sys.tables t
	JOIN sys.schemas sc ON sc.schema_id = t.schema_id
	JOIN sys.columns c ON c.object_id = t.object_id
	JOIN sys.types ty ON ty.user_type_id = c.user_type_id
	WHERE
	sc.name != 'cdc'
	AND sc.name != 'history'
	AND t.name not like '#%' and t.name != 'awsdms_truncation_safeguard'
	AND c.is_identity = 0
	AND c.is_computed = 0
	AND c.is_filestream = 0
	AND c.is_xml_document = 0
	AND t.is_memory_optimized = 0
	AND ty.max_length != -1 -- skip (n)varchar(max) columns
	AND ty.max_length < 200 -- skip long text columns
	AND ty.user_type_id NOT IN (34, 35, 99)
	AND OBJECTPROPERTY(t.object_id, 'IsTable') = 1
	AND (@pSchemaName IS NULL OR sc.name = @pSchemaName)
	AND (@pTableName IS NULL OR t.name = @pTableName)
	AND (@pColumnName IS NULL OR c.name = @pColumnName)
	ORDER BY
	sc.name, t.name, c.name
	OPTION(RECOMPILE);

	open colcur
	fetch next from colcur into @SchemaName, @TableName, @ColumnName

	while @@fetch_status = 0
	begin

	if @Debug = 1
	print '--' + quotename(@SchemaName) + '.' + quotename(@TableName) + '.' + quotename(@ColumnName)

	select
	@TableRowCount = sum(rows)
	from
	sys.partitions p
	join sys.tables t on p.object_id = t.object_id
	join sys.schemas sc on sc.schema_id = t.schema_id
	where
	sc.name = @SchemaName
	and t.name = @TableName;

	/*
	SamplePercent:
	if the parameter is -1, use the average of the sample ratio of all stats on the table
	if the parameter is 0, use the max sample ratio of all stats on the table
	otherwise use the sample percent that was supplied
	*/
	select
	@SamplePercent =
	case
	when @SamplePercent = -1 then isnull(nullif(cast(round(avg((100. * sp.rows_sampled) / sp.unfiltered_rows),0) as int),0),1)
	when @SamplePercent = 0 then isnull(nullif(cast(round(max((100. * sp.rows_sampled) / sp.unfiltered_rows),0) as int),0),1)
	else @SamplePercent
	end
	from sys.tables t
	join sys.stats s on t.object_id = s.object_id
	join sys.columns c on t.object_id = c.object_id
	join sys.schemas sc on sc.schema_id = t.schema_id
	cross apply sys.dm_db_stats_properties(s.object_id, s.stats_id) sp
	where
	s.has_filter = 0
	and t.name = @TableName
	and sc.name = @SchemaName
	and c.name = @ColumnName
	and index_col(sc.name + '.' + t.name, s.stats_id, 1) = @ColumnName

	-- Generate the base distibution histogram: value, freq(value)
	set @tallysql =
	'select
	[key] = ' + quotename(@ColumnName) + ',
	x = dense_rank() over (order by ' + quotename(@ColumnName) + '),
	f = count_big(1)
	from ' + quotename(@SchemaName) + '.' + quotename(@TableName) + '
	tablesample (' + cast(@SamplePercent as nvarchar(3)) + ' percent)
	where ' + quotename(@ColumnName) + ' is not null
	group by ' + quotename(@ColumnName) + ';'

	set @sql = 'insert into #tmp_tally_table ' + @tallysql;

	if @Debug = 1
	print @sql;
	else
	exec sp_executesql @sql;

	/*
	Calculate skewness stats:
	b1 -> population skewness
	G1 -> sample skewness (corrects for sample bias)
	ses -> standard error of skew. This is a standard deviation calculation on the G1
	Zg1 -> test statistic which provides more meaning to the skewness calculation
	*/

	set @sql =
	'insert #table_analysis
	(
	schema_name,
	table_name,
	column_name,
	b1,
	G1,
	ses,
	distinct_values,
	total_values,
	distinct_ratio,
	table_rows,
	distinct_ratio_null,
	Zg1,
	tallysql
	)
	select
	''' + @SchemaName + ''',
	''' + @TableName + ''',
	''' + @ColumnName + ''',
	sk.b1,
	sk.G1,
	sk.ses,
	d,
	n,
	(100. * d)/n,
	' + cast(@TableRowCount as nvarchar(255)) + ',
	(100.*d)/' + cast(@TableRowCount as nvarchar(255)) + ',
	zG1 = case sk.ses when 0 then sk.G1 else sk.G1/sk.ses end,
	''' + @tallysql + '''
	from (
	select
	b1 =
	case
	when (power(m2,1.5)) = 0 then 0
	else m3 / (power(m2,1.5))
	end,
	G1 =
	case
	when n <= 2 then 0
	else
	(sqrt(1.(n(n-1)))/(n-2)) *
	case
	when (power(m2,1.5)) = 0 then 0
	else m3 / (power(m2,1.5))
	end
	end,
	ses = case
	when ((n-2.)(n+1.)(n+3.)) = 0 then 0
	else sqrt(((1.(6.n)(n-1.)))/((n-2.)(n+1.)*(n+3.)))
	end,
	d,n
	from (
	select
	n,
	d,
	m2 = sum(power((x-(sxf/n)),2)*f)/n,
	m3 = sum(power((x-(sxf/n)),3)*f)/n
	from (
	select
	x,
	f,
	sxf = 1. * sum(x*f) over(),
	n = sum(f) over(),
	d = count(x) over()
	from #tmp_tally_table
	) base
	group by n,d
	) agg
	) sk';

	if @Debug = 1
	print @sql;
	else
	exec sp_executesql @sql;

	set @sql = 'truncate table #tmp_tally_table;';
	if @Debug = 1
	print @sql;
	else
	exec sp_executesql @sql;

	fetch next from colcur into @SchemaName, @TableName, @ColumnName
	end
	close colcur;
	deallocate colcur;

	drop table if exists #tmp_tally_table;

	-- The critical value of Zg1 is approximately 2 (This is a two-tailed test of skewness ≠ 0 at roughly the 0.05 significance level)
	-- If Zg1 < −2, the population is very likely skewed negatively (though you don’t know by how much).
	-- If Zg1 is between −2 and +2, you can’t reach any conclusion about the skewness of the population: it might be symmetric, or it might be skewed in either direction.
	-- If Zg1 > 2, the population is very likely skewed positively (though you don’t know by how much).

	select
	*
	from
	#table_analysis
	where
	abs(Zg1) > 2
	and distinct_ratio <= 40
	order by
	abs(Zg1) desc;

	--drop table #table_analysis

	END

view raw CalculateSkew.sql hosted with

by GitHub

SQL Server: Script out all indexes in a database

May 24, 2024 by admin

Kendra Little has a gist to script out all indexes HOW TO SCRIPT OUT INDEXES FROM SQL SERVER but it didn’t include XML or columnstore indexes, so I’ve forked her gist and added a few things to it. I changed the FOR XML/STUFF trick into STRING_AGG() (which is SQL Server 2017 onwards) for no other reason than I’m not working with any instance versions less than that.

The updated gist is here.

	/*****************************************************************************
	MIT License, http://www.opensource.org/licenses/mit-license.php
	Contact: help@sqlworkbooks.com
	Copyright (c) 2018 SQL Workbooks LLC
	Copyright (c) 2024 Mitch Wheat
	Permission is hereby granted, free of charge, to any person
	obtaining a copy of this software and associated documentation
	files (the "Software"), to deal in the Software without
	restriction, including without limitation the rights to use,
	copy, modify, merge, publish, distribute, sublicense, and/or
	sell copies of the Software, and to permit persons to whom
	the Software is furnished to do so, subject to the following
	conditions:
	The above copyright notice and this permission notice shall be
	included in all copies or substantial portions of the Software.
	THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
	EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
	OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
	NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
	HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
	WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
	FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
	OTHER DEALINGS IN THE SOFTWARE.
	*****************************************************************************/

	-- Removed FOR-XML 'trick' used to concatenate comma separated values with STRING_AGG()
	-- STRING_AGG() is SQL Server 2017 onwards...

	-- Added:
	-- is_autocreated
	-- xml indexes
	-- columnstore indexes

	-- Currently not showing:
	--
	-- 1. xml compression by partition (SQL Server2022 onwards)
	-- 2. spatial indexes
	-- 3. selective xml indexes
	-- 4. NONCLUSTERED HASH indexes

	SELECT
	DB_NAME() AS [database],
	sc.name + N'.' + t.name AS [table],
	si.name AS [index],
	si.type_desc AS indextype,
	CASE si.index_id
	WHEN 0 THEN N'/* Heap */'
	ELSE
	CASE WHEN si.is_primary_key = 1 THEN
	N'ALTER TABLE ' + QUOTENAME(sc.name) + N'.' + QUOTENAME(t.name) + N' ADD CONSTRAINT ' + QUOTENAME(si.name) + N' PRIMARY KEY ' +
	CASE
	WHEN si.index_id > 1 THEN N'NON'
	ELSE N''
	END + N'CLUSTERED '
	ELSE N'CREATE ' +
	CASE WHEN si.is_unique = 1 then N'UNIQUE ' ELSE N'' END +
	CASE WHEN xi.using_xml_index_id IS NULL AND (si.type = 3) THEN N'PRIMARY ' ELSE N'' END +
	CASE
	WHEN si.type = 1 THEN N'CLUSTERED'
	WHEN si.type = 2 THEN N'NONCLUSTERED'
	WHEN si.type = 3 THEN N'XML'
	WHEN si.type = 5 THEN N'CLUSTERED COLUMNSTORE'
	WHEN si.type = 6 THEN N'NONCLUSTERED COLUMNSTORE'
	--WHEN si.type = 4 THEN N'SPATIAL'
	--WHEN si.type = 7 THEN N'NONCLUSTERED HASH'
	ELSE N''
	END +
	N' INDEX ' + QUOTENAME(si.name) + N' ON ' + QUOTENAME(sc.name) + N'.' + QUOTENAME(t.name) + N''
	END +
	CASE
	WHEN key_column_list IS NOT NULL AND (si.type != 5) THEN '(' + key_column_list + N')'
	WHEN xmlkey_column_list IS NOT NULL AND (si.type= 3) THEN '(' + xmlkey_column_list + N')'
	WHEN include_column_list IS NOT NULL AND (si.type = 6) THEN '(' + include_column_list + N')'
	ELSE N''
	END +
	CASE WHEN primary_name IS NOT NULL THEN N' USING XML INDEX ' + primary_name + N' FOR ' + xi.secondary_type_desc COLLATE Latin1_General_CI_AS ELSE N'' END +
	CASE WHEN include_column_list IS NOT NULL AND (si.type != 5 AND si.type != 6) THEN N' INCLUDE (' + include_column_list + N')' ELSE N'' END +
	CASE WHEN si.filter_definition IS NOT NULL THEN N' WHERE ' + si.filter_definition ELSE N'' END +
	/* with clause */
	CASE WHEN row_compression_partition_list IS NOT NULL OR page_compression_partition_list IS NOT NULL OR columnstore_compression_partition_list IS NOT NULL
	THEN N' WITH (' +
	CASE
	WHEN row_compression_partition_list IS NOT NULL THEN
	N'DATA_COMPRESSION = ROW ' + CASE WHEN psc.name IS NULL THEN N'' ELSE + N' ON PARTITIONS (' + row_compression_partition_list + N')' END
	ELSE N''
	END +
	CASE
	WHEN row_compression_partition_list IS NOT NULL AND page_compression_partition_list IS NOT NULL THEN N', '
	ELSE N''
	END +
	CASE
	WHEN page_compression_partition_list IS NOT NULL THEN N'DATA_COMPRESSION = PAGE' +
	CASE WHEN psc.name IS NULL THEN N'' ELSE + N' ON PARTITIONS (' + page_compression_partition_list + N')' END
	ELSE N''
	END +
	CASE
	WHEN row_compression_partition_list IS NOT NULL AND page_compression_partition_list IS NOT NULL AND columnstore_compression_partition_list IS NOT NULL THEN N', '
	ELSE N''
	END +
	CASE
	WHEN columnstore_compression_partition_list IS NOT NULL THEN N'DATA_COMPRESSION = COLUMNSTORE' +
	CASE WHEN psc.name IS NULL THEN N'' ELSE + N' ON PARTITIONS (' + columnstore_compression_partition_list + N')' END
	ELSE N''
	END +
	CASE
	WHEN si.compression_delay IS NOT NULL THEN N', COMPRESSION_DELAY = ' + CAST(si.compression_delay AS NVARCHAR(16))
	ELSE N''
	END
	+ N')'
	ELSE N''
	END +
	CASE
	WHEN si.type != 3 THEN
	/* ON where? filegroup? partition scheme? */
	' ON ' + CASE WHEN psc.name is null
	THEN ISNULL(QUOTENAME(fg.name), N'')
	ELSE psc.name + N' (' + partitioning_column.column_name + N')'
	END
	ELSE N''
	END
	+ N';'
	END AS create_statement,
	partition_sums.row_count,
	si.is_primary_key,
	si.fill_factor,
	si.is_unique,
	si.auto_created AS is_auto_created,
	si.is_hypothetical,
	stat.user_seeks,
	stat.user_scans,
	stat.user_lookups,
	user_updates,
	(SELECT MAX(user_reads) FROM (VALUES (last_user_seek), (last_user_scan), (last_user_lookup)) AS value(user_reads)) AS last_user_read,
	last_user_update,
	partition_sums.reserved_in_row_GB,
	partition_sums.reserved_LOB_GB,
	si.allow_page_locks,
	si.allow_row_locks,
	partition_sums.partition_count,
	ISNULL(pf.name, '/* Not partitioned */') AS partition_function,
	ISNULL(psc.name, fg.name) AS partition_scheme_or_filegroup,
	t.create_date AS table_created_date,
	t.modify_date AS table_modify_date
	FROM
	sys.indexes si
	JOIN sys.tables t ON si.object_id = t.object_id
	JOIN sys.schemas sc ON t.schema_id = sc.schema_id
	LEFT JOIN sys.dm_db_index_usage_stats stat ON stat.database_id = DB_ID() and si.object_id = stat.object_id and si.index_id = stat.index_id
	LEFT JOIN sys.partition_schemes psc ON si.data_space_id = psc.data_space_id
	LEFT JOIN sys.partition_functions pf ON psc.function_id = pf.function_id
	LEFT JOIN sys.filegroups fg ON si.data_space_id = fg.data_space_id
	LEFT JOIN sys.xml_indexes xi ON xi.object_id = si.object_id and xi.index_id = si.index_id
	LEFT JOIN
	(
	SELECT
	i2.object_id,
	i2.index_id,
	i2.name as primary_name
	FROM sys.xml_indexes i2
	) sxi ON sxi.object_id = xi.object_id AND xi.using_xml_index_id = sxi.index_id

	/* Key list */
	OUTER APPLY (
	SELECT STRING_AGG(QUOTENAME(c.name) + CASE ic.is_descending_key WHEN 1 then N' DESC' ELSE N'' END, N', ') WITHIN GROUP (ORDER BY ic.key_ordinal)
	FROM sys.index_columns ic
	JOIN sys.columns AS c ON ic.column_id = c.column_id and ic.object_id = c.object_id
	WHERE ic.object_id = si.object_id
	and ic.index_id = si.index_id
	and (ic.key_ordinal > 0)
	) AS keys ( key_column_list )

	/* XML Key list */
	OUTER APPLY (
	SELECT STRING_AGG(QUOTENAME(c.name) + CASE ic.is_descending_key WHEN 1 then N' DESC' ELSE N'' END, N', ') WITHIN GROUP (ORDER BY ic.key_ordinal)
	FROM sys.index_columns ic
	JOIN sys.columns AS c ON ic.column_id = c.column_id and ic.object_id = c.object_id
	WHERE ic.object_id = si.object_id
	and ic.index_id = si.index_id
	and (ic.key_ordinal = 0 and is_included_column = 0)
	) AS xmlkeys ( xmlkey_column_list )

	/* Partitioning Column */
	OUTER APPLY (
	SELECT MAX(QUOTENAME(c.name)) AS column_name
	FROM sys.index_columns ic
	JOIN sys.columns AS c ON ic.column_id = c.column_id and ic.object_id = c.object_id
	WHERE ic.object_id = si.object_id
	and ic.index_id = si.index_id
	and ic.partition_ordinal = 1) AS partitioning_column

	/* Include list */
	OUTER APPLY (
	SELECT STRING_AGG(QUOTENAME(c.name), N', ') WITHIN GROUP (ORDER BY c.name)
	FROM sys.index_columns ic
	JOIN sys.columns AS c ON ic.column_id = c.column_id and ic.object_id = c.object_id
	WHERE ic.object_id = si.object_id
	and ic.index_id = si.index_id
	and ic.is_included_column = 1
	) AS includes ( include_column_list )

	/* Partitions */
	OUTER APPLY (
	SELECT
	COUNT(*) AS partition_count,
	CAST(SUM(ps.in_row_reserved_page_count)*8./1024./1024. AS NUMERIC(32,1)) AS reserved_in_row_GB,
	CAST(SUM(ps.lob_reserved_page_count)*8./1024./1024. AS NUMERIC(32,1)) AS reserved_LOB_GB,
	SUM(ps.row_count) AS row_count
	FROM sys.partitions p
	JOIN sys.dm_db_partition_stats AS ps ON p.partition_id = ps.partition_id
	WHERE p.object_id = si.object_id
	and p.index_id = si.index_id
	) AS partition_sums

	/* row compression list by partition */
	OUTER APPLY (

	SELECT STRING_AGG(CAST(p.partition_number AS NVARCHAR(16)), N', ') WITHIN GROUP (ORDER BY p.partition_number)
	FROM sys.partitions p
	WHERE
	p.object_id = si.object_id
	and p.index_id = si.index_id
	and p.data_compression = 1
	) AS row_compression ( row_compression_partition_list )

	/* page compression list by partition */
	OUTER APPLY (
	SELECT STRING_AGG(CAST(p.partition_number AS NVARCHAR(16)), N', ') WITHIN GROUP (ORDER BY p.partition_number)
	FROM sys.partitions p
	WHERE
	p.object_id = si.object_id
	and p.index_id = si.index_id
	and p.data_compression = 2
	) AS page_compression ( page_compression_partition_list )

	/* columnstore compression list by partition */
	OUTER APPLY (
	SELECT STRING_AGG(CAST(p.partition_number AS NVARCHAR(16)), N', ') WITHIN GROUP (ORDER BY p.partition_number)
	FROM sys.partitions p
	WHERE
	p.object_id = si.object_id
	and p.index_id = si.index_id
	and p.data_compression = 3
	) AS columnstore_compression ( columnstore_compression_partition_list )

	--/* XML compression list by partition: SQL Server 2022 onwards */
	--OUTER APPLY (
	-- SELECT STRING_AGG(CAST(p.partition_number AS NVARCHAR(16)), N', ') WITHIN GROUP (ORDER BY p.partition_number)
	-- FROM sys.partitions p
	-- WHERE
	-- p.object_id = si.object_id
	-- and p.index_id = si.index_id
	-- and p.xml_compression = 1 -- 2022 onwards
	-- ) AS xml_compression ( xml_compression_partition_list )

	WHERE
	si.type IN (0, 1, 2, 3, 5, 6)
	ORDER BY
	sc.name, t.name, si.index_id
	OPTION(RECOMPILE);
	GO

view raw Script-out-all-indexes-in-a-database.sql hosted with

by GitHub

SQL Server: Compressing a Table and/or Indexes

May 10, 2024 by admin

I always forget whether the first syntax compresses the NC indexes as well, so posting here so I don’t forget again!

This compresses just the clustered index (i.e. the table data):

-- Just clustered index
ALTER TABLE dbo.Table
REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE);

This compresses all indexes including the clustered index:

-- All indexes including clustered index
ALTER INDEX ALL ON dbo.Table
REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE);

Solving the Monty Hall Problem using Bayes Theorem

January 1, 2023 by admin

The ‘Monty Hall’ problem is best described by the wikipedia entry:

The Monty Hall problem is a probability puzzle, loosely based on the American television game show Let’s Make a Deal and named after its original host, Monty Hall. The problem was originally posed (and solved) in a letter by Steve Selvin to the American Statistician in 1975.It became famous as a question from reader Craig F. Whitaker’s letter quoted in Marilyn vos Savant‘s “Ask Marilyn” column in Parade magazine in 1990.

Suppose you’re on a game show, and you’re given the choice of three doors A, B and C.

Behind one door is a car; behind the other two are goats. You pick a door, say door A, and the host, who knows what’s behind the doors, opens another door, say door B, which has a goat. He then says to you, “Do you want to change your choice to door C?”

Is it to your advantage to switch? (It is!)

Many readers of vos Savant’s column refused to believe switching is beneficial and rejected her explanation. After the problem appeared in Parade, approximately 10,000 readers, including nearly 1,000 with PhDs, wrote to the magazine, most of them calling vos Savant wrong.^[4] Even when given explanations, simulations, and formal mathematical proofs, many people still did not accept that switching is the best strategy.^[5]Paul Erdős, one of the most prolific mathematicians in history, remained unconvinced until he was shown a computer simulation demonstrating vos Savant’s predicted result.^[6]

First we need to define some notation:

A, B – events
P(A) – the probability of event A occurring
P(B) – the probability of event B occurring
P(A | B) – the probability of event A occurring, given that event B has already occurred
P(B | A) – the probability of event B occurring, given that event A has already occurred

Bayes Theorem is defined as:

The following reasoning is from Julian Havil’s book “Impossible?”

Assign symbols to the events:

A – the event “car is behind door A”
B – the event “car is behind door B”
C – the event “car is behind door C”
M_A – the event “Monty opens door A” … similarly for M_B , M_C

Assume door A is chosen initially by the player, so Monty can open door B or C:

P(M_B| A) = ½ , P(M_B| B) = 0, P(M_B| C) = 1

So, since A, B and C are mutually exclusive events:

P(M_B) = P(M_B| A)P(A) + P(M_B| B)P(B) + P(M_B| C)P(C) = ½ x ⅓ + 0 x ⅓ + 1 x ⅓ = ½

Now, the player can stick or change. If they stick with door A, their probability of winning the car is:

P(A | M_B) = P(M_B| A)P(A) / P(M_B) = (½ x ⅓) / ½ = ⅓

If they switch to door C, their probability of winning the car is:

P(C | M_B) = P(M_B| C)P(C) / P(M_B) = (1 x ⅓) / ½ = ⅔

Performance Improvements in .NET 6

November 22, 2021 by admin

There have been a large number of Performance Improvements in .NET 6 as evidenced in Stephen Toub’s blog post. Most of the time we don’t care about assembly level performance optimisations because the bottleneck is usually accessing some external resource, such as a database or web service.

If you need to benchmark .NET code take a look at a great tool, BenchmarkDotNet and also take a look at the book referenced there, Pro .NET Benchmarking as getting benchmarking correct can sometimes be quite tricky.

If you’re not yet on .NET 6, and you have code in a large loop that you want to squeeze some performance out of it (and it won’t make the code hard to understand and maintain!) here are a couple of simple tips:

Testing if n is even: replace (n % 2 == 0) with ((n & 1) == 0)
Dividing by 2: replace n / 2 by n >> 1 (but be aware of the unsigned / signed behaviour of right shift)

LINQPad script to Generate SQL Server Database Restore Script from Ola Hallengren’s Backup Solution

November 21, 2021November 21, 2021 by admin

Unless you perform regular restores of your database backups, you don’t know that you actually have a valid backup. In a career spanning over 30 years, I’ve seen two occasions where a company was performing backups (or so they thought!) and sending tapes offsite, assuming they were good when in fact the tapes were blank!

The majority of SQL Server installations use Ola Hallengren’s maintenance solution (and certainly all the ones I’ve had anything to do with).

If you are doing regular (5 minutes or less) transaction log backups, a restore might involve applying quite a few transaction logs.

I’ve written a short LINQPad script here which will generate the TSQL to perform a database restore either from a point in time or the latest available, based upon the default locations and naming conventions used by Ola’s backups. It’s Differential backup aware, as well as creating the multiple Transaction Log restore statements. It’s also takes into account where backups are split into separate backup files (which is quite common). You specify the server name, the database name, the root folder where the backups are stored, and either a point in time or the latest.

Disclaimer: Use at your own risk AND test thoroughly!

Example output:

USE [master]
 
RESTORE DATABASE [AdventureWorks] FROM
   DISK = N'C:\temp\Backup\K7\AdventureWorks\FULL\K7_AdventureWorks_FULL_20211118_151558.bak'
 WITH NORECOVERY, REPLACE
 
RESTORE DATABASE [AdventureWorks] FROM
   DISK = N'C:\temp\Backup\K7\AdventureWorks\DIFF\K7_AdventureWorks_DIFF_20211118_152101.bak'
 WITH NORECOVERY
 
RESTORE DATABASE [AdventureWorks] FROM
   DISK = N'C:\temp\Backup\K7\AdventureWorks\LOG\K7_AdventureWorks_LOG_20211118_152226.trn'
 WITH NORECOVERY, STOPAT = '2021-11-21 17:07:22'
 
RESTORE DATABASE [AdventureWorks] WITH RECOVERY

.NET: Disable Insecure TLS protocols

September 16, 2021 by admin

TLS1.1 and TLS1.0 (and lower) protocols are insecure and should no longer be used.

For .NET 4.7 or later, you do not need to set System.Net.ServicePointManager.SecurityProtocol. The default value (SecurityProtocolType.SystemDefault) allows the operating system to use whatever versions it has been configured for, including any new versions that may not have existed at the time your application was created.

If you want to explicitly code this in .NET, rather than specify the allowed protocols, disable the disallowed protocols before making any connections:

// TLS must be 1.2 or greater. Disable SSL3, TLS1.0 and TLS1.1 [Note: this is the default behaviour for .NET 4.7 or later] 
ServicePointManager.SecurityProtocol &= (~SecurityProtocolType.Ssl3 & ~SecurityProtocolType.Tls & ~SecurityProtocolType.Tls11);

SQL Server 2016 SP3: Last Service Pack for SQL Server

September 16, 2021 by admin

Pedro Lopes has announced that “The 3^rd and final Service Pack release for SQL Server 2016 is now available for download at the Microsoft Downloads site”

Tuning SQL Server Queries 101

May 31, 2021 by admin

First, get an actual query execution plan. Look for warnings in the query plan.

Basic

First look for large scans and lookups: these can often be resolved by creating a new index or extending an existing one with additional included columns. Seeks are usually preferable to scans.
Then look for significant variance of Actual versus Estimated row counts: you can provide the optimiser with more accurate information by updating statistics, creating new statistics objects, adding statistics on computed columns, or by breaking the query up into simpler parts. Might be caused by ‘parameter sniffing’.
Then look for expensive operators in the query plan, especially those that consume memory such as sorts and hashes. Sorting can sometimes be avoided by altering/adding indexes.

More Advanced

Joe Sack has an excellent walkthrough here: The Case of the Cardinality Estimate Red Herring

Vector and Matrix Calculus Explained

May 6, 2021 by admin

A really clear explanation of Matrix Calculus

This paper is an attempt to explain all the matrix calculus you need in order to understand the training of deep neural networks.