I wanted a way to automatically identify all poorly clustered tables across the entire Snowflake account. ‘SYSTEM$CLUSTERING_INFORMATION’ doesn’t have a built-in way to run against all clustered tables, but it’s straight forward to create a stored procedure to automate this. If you have a lot of clustered tables across all your databases, this could take 4 – 5 minutes to run using a X-SMALL warehouse.
To use ‘SYSTEM$CLUSTERING_INFORMATION’, a role must have the following privileges:
USAGE on the database and schema containing the table.
(This is Part 1 with a quick overview of the basics of clustering; Part 2 shows how to automatically detect poorly clustered tables.)
Clustering is probably one of the most misunderstood concepts in Snowflake. Snowflake clustering optimizes query performance on large tables (> 1 terabyte) by organizing data into partitions based on specific keys, enabling efficient partition pruning.
Clustering Key Recommendations:
Prioritize Filtering Columns: Choose columns used in WHERE clauses.
Optimal Cardinality: Aim for a balance. Too few values (e.g., 3-4) or too many (e.g., unique IDs such as UUIDs) won’t prune effectively, and make maintenance costly.
Date Truncation: Use DATE_TRUNC(‘DAY’, …) on timestamps to group data efficiently rather than clustering by the minute or second.
Limit Key Length: Use a maximum of 3 or 4 columns per key to avoid high maintenance costs.
Clustering keys are not intended for all tables due to the costs of initially clustering the data and maintaining the clustering.
When NOT to Cluster:
Small tables (< 1TB) that fit into a few micro-partitions.
Tables without a clear filter pattern.
Tables that are naturally inserted into in date order.
Tables with very high update/delete volume (high maintenance overhead). [You should consider Hybrid tables for these.]
VARCHAR columns: Snowflake’s current implementation of clustering for VARCHAR columns only uses the first 5 to 6 bytes of the string. If the initial characters of your strings are identical (e.g., all starting with “HTTPS://”), clustering on the raw column will be ineffective for pruning.
Columns with too few values (Such as a BOOLEAN column)
Columns containing highly skewed data.
Clustering is justified when either:
You require the fastest possible response times, regardless of cost.
Your improved query performance offsets the credits required to cluster and maintain the table.
If a large table has been clustered with a poor choice of clustering key(s), you could be using extra compute (and therefore credits) two ways; firstly, queries scanning many more partitions than necessary (basically an entire table scan), and secondly, if automatic clustering is turned on, extra costs re-clustering the table. Automatic Clustering incurs costs which you can track using the AUTOMATIC_CLUSTERING_HISTORY view to ensure query performance benefits outweigh costs.
It produces information similar to this example of a poorly clustered table, using UUID columns as clustering keys:
Right-Skewed: The vast majority of partitions should be concentrated in the lower-end buckets, ideally between 0 and 16. 00000 or 00001 is the largest bucket: this indicates most partitions have little to no overlap with others.
Low average depth: The average depth is close to 1.
Minimal Tail (not left skewed): Very little data in the higher-number buckets. If the buckets 00032, 00064, 00128+ have high counts, the table is not well-clustered.
High Constant Partition Count: a large number of partitions are in a “constant” state (meaning the data range is unique to that partition and does not overlap with others).
Low average overlaps: The average number of overlapping partitions is low.
A poorly clustered table, in contrast, will have a high percentage of partitions in the higher-numbered buckets (e.g., 00128 or higher). i.e. Left Skewed.
I always end up writing this again as I forget to save it!
It generates the SQL that you then need to execute. Advise caution on very large tables.
-- Generate SQL to find Max length of all of a table's text columns:
SELECT
'SELECT ' ||
LISTAGG('MAX(LEN(' || column_name || ') AS max_' || column_name || ' as ' || column_name, ', ') WITHIN GROUP (ORDER BY ordinal_position) ||
' FROM ' || table_schema || '.' || table_name
FROM
SVV_COLUMNS
WHERE
table_schema = 'mySchema' AND table_name= 'myTable'
AND data_type IN ('character', 'character varying')
GROUP BY
table_schema,
table_name
;
Queries with parameter sensitive plans can perform poorly when an inappropriate query plan is used.
Even if your statistics are up to date, parameter sensitive plans can be caused by skewed data, so performing a data skew analysis can identify which filter columns might be involved in poor query plans.
I’ve adapted the code found here into a SQL Server stored procedure that can be run across an entire database, a schema, a single table or just a single column.
It should be relatively easy to convert this to other RDBMS.
Here’s an example of the output when run on the Stackoverflow 2013 downloadable database (approximately 50GB):
Kendra Little has a gist to script out all indexes HOW TO SCRIPT OUT INDEXES FROM SQL SERVER but it didn’t include XML or columnstore indexes, so I’ve forked her gist and added a few things to it. I changed the FOR XML/STUFF trick into STRING_AGG() (which is SQL Server 2017 onwards) for no other reason than I’m not working with any instance versions less than that.
The ‘Monty Hall’ problem is best described by the wikipedia entry:
The Monty Hall problem is a probability puzzle, loosely based on the American television game show Let’s Make a Deal and named after its original host, Monty Hall. The problem was originally posed (and solved) in a letter by Steve Selvin to the American Statistician in 1975.It became famous as a question from reader Craig F. Whitaker’s letter quoted in Marilyn vos Savant‘s “Ask Marilyn” column in Parade magazine in 1990.
Suppose you’re on a game show, and you’re given the choice of three doors A, B and C.
Behind one door is a car; behind the other two are goats. You pick a door, say door A, and the host, who knows what’s behind the doors, opens another door, say door B, which has a goat. He then says to you, “Do you want to change your choice to door C?”
Is it to your advantage to switch? (It is!)
Many readers of vos Savant’s column refused to believe switching is beneficial and rejected her explanation. After the problem appeared in Parade, approximately 10,000 readers, including nearly 1,000 with PhDs, wrote to the magazine, most of them calling vos Savant wrong.[4] Even when given explanations, simulations, and formal mathematical proofs, many people still did not accept that switching is the best strategy.[5]Paul Erdős, one of the most prolific mathematicians in history, remained unconvinced until he was shown a computer simulation demonstrating vos Savant’s predicted result.[6]
First we need to define some notation:
A, B – events P(A) – the probability of event A occurring P(B) – the probability of event B occurring P(A | B) – the probability of event A occurring, given that event B has already occurred P(B | A) – the probability of event B occurring, given that event A has already occurred
Bayes Theorem is defined as:
The following reasoning is from Julian Havil’s book “Impossible?”
Assign symbols to the events:
A – the event “car is behind door A” B – the event “car is behind door B” C – the event “car is behind door C” MA – the event “Monty opens door A” … similarly for MB , MC
Assume door A is chosen initially by the player, so Monty can open door B or C:
P(MB | A) = ½ , P(MB | B) = 0, P(MB | C) = 1
So, since A, B and C are mutually exclusive events:
P(MB) = P(MB | A)P(A) + P(MB | B)P(B) + P(MB | C)P(C) = ½ x ⅓ + 0 x ⅓ + 1 x ⅓ = ½
Now, the player can stick or change. If they stick with door A, their probability of winning the car is:
P(A | MB) = P(MB | A)P(A) / P(MB) = (½ x ⅓) / ½ = ⅓
If they switch to door C, their probability of winning the car is:
P(C | MB) = P(MB | C)P(C) / P(MB) = (1 x ⅓) / ½ = ⅔
There have been a large number of Performance Improvements in .NET 6 as evidenced in Stephen Toub’s blog post. Most of the time we don’t care about assembly level performance optimisations because the bottleneck is usually accessing some external resource, such as a database or web service.
If you need to benchmark .NET code take a look at a great tool, BenchmarkDotNet and also take a look at the book referenced there, Pro .NET Benchmarking as getting benchmarking correct can sometimes be quite tricky.
If you’re not yet on .NET 6, and you have code in a large loop that you want to squeeze some performance out of it (and it won’t make the code hard to understand and maintain!) here are a couple of simple tips:
Testing if n is even: replace (n % 2 == 0) with ((n & 1) == 0)
Dividing by 2: replace n / 2 by n >> 1 (but be aware of the unsigned / signed behaviour of right shift)
Unless you perform regular restores of your database backups, you don’t know that you actually have a valid backup. In a career spanning over 30 years, I’ve seen two occasions where a company was performing backups (or so they thought!) and sending tapes offsite, assuming they were good when in fact the tapes were blank!
If you are doing regular (5 minutes or less) transaction log backups, a restore might involve applying quite a few transaction logs.
I’ve written a short LINQPad script here which will generate the TSQL to perform a database restore either from a point in time or the latest available, based upon the default locations and naming conventions used by Ola’s backups. It’s Differential backup aware, as well as creating the multiple Transaction Log restore statements. It’s also takes into account where backups are split into separate backup files (which is quite common). You specify the server name, the database name, the root folder where the backups are stored, and either a point in time or the latest.
Disclaimer: Use at your own risk AND test thoroughly!
Example output:
USE [master]
RESTORE DATABASE [AdventureWorks] FROM
DISK = N'C:\temp\Backup\K7\AdventureWorks\FULL\K7_AdventureWorks_FULL_20211118_151558.bak'
WITH NORECOVERY, REPLACE
RESTORE DATABASE [AdventureWorks] FROM
DISK = N'C:\temp\Backup\K7\AdventureWorks\DIFF\K7_AdventureWorks_DIFF_20211118_152101.bak'
WITH NORECOVERY
RESTORE DATABASE [AdventureWorks] FROM
DISK = N'C:\temp\Backup\K7\AdventureWorks\LOG\K7_AdventureWorks_LOG_20211118_152226.trn'
WITH NORECOVERY, STOPAT = '2021-11-21 17:07:22'
RESTORE DATABASE [AdventureWorks] WITH RECOVERY
TLS1.1 and TLS1.0 (and lower) protocols are insecure and should no longer be used.
For .NET 4.7 or later, you do not need to set System.Net.ServicePointManager.SecurityProtocol. The default value (SecurityProtocolType.SystemDefault) allows the operating system to use whatever versions it has been configured for, including any new versions that may not have existed at the time your application was created.
If you want to explicitly code this in .NET, rather than specify the allowed protocols, disable the disallowed protocols before making any connections:
// TLS must be 1.2 or greater. Disable SSL3, TLS1.0 and TLS1.1 [Note: this is the default behaviour for .NET 4.7 or later]
ServicePointManager.SecurityProtocol &= (~SecurityProtocolType.Ssl3 & ~SecurityProtocolType.Tls & ~SecurityProtocolType.Tls11);