I wanted a way to automatically identify all poorly clustered tables across the entire Snowflake account. ‘SYSTEM$CLUSTERING_INFORMATION’ doesn’t have a built-in way to run against all clustered tables, but it’s straight forward to create a stored procedure to automate this. If you have a lot of clustered tables across all your databases, this could take 4 – 5 minutes to run using a X-SMALL warehouse.
To use ‘SYSTEM$CLUSTERING_INFORMATION’, a role must have the following privileges:
USAGE on the database and schema containing the table.
(This is Part 1 with a quick overview of the basics of clustering; Part 2 shows how to automatically detect poorly clustered tables.)
Clustering is probably one of the most misunderstood concepts in Snowflake. Snowflake clustering optimizes query performance on large tables (> 1 terabyte) by organizing data into partitions based on specific keys, enabling efficient partition pruning.
Clustering Key Recommendations:
Prioritize Filtering Columns: Choose columns used in WHERE clauses.
Optimal Cardinality: Aim for a balance. Too few values (e.g., 3-4) or too many (e.g., unique IDs such as UUIDs) won’t prune effectively, and make maintenance costly.
Date Truncation: Use DATE_TRUNC(‘DAY’, …) on timestamps to group data efficiently rather than clustering by the minute or second.
Limit Key Length: Use a maximum of 3 or 4 columns per key to avoid high maintenance costs.
Clustering keys are not intended for all tables due to the costs of initially clustering the data and maintaining the clustering.
When NOT to Cluster:
Small tables (< 1TB) that fit into a few micro-partitions.
Tables without a clear filter pattern.
Tables that are naturally inserted into in date order.
Tables with very high update/delete volume (high maintenance overhead). [You should consider Hybrid tables for these.]
VARCHAR columns: Snowflake’s current implementation of clustering for VARCHAR columns only uses the first 5 to 6 bytes of the string. If the initial characters of your strings are identical (e.g., all starting with “HTTPS://”), clustering on the raw column will be ineffective for pruning.
Columns with too few values (Such as a BOOLEAN column)
Columns containing highly skewed data.
Clustering is justified when either:
You require the fastest possible response times, regardless of cost.
Your improved query performance offsets the credits required to cluster and maintain the table.
If a large table has been clustered with a poor choice of clustering key(s), you could be using extra compute (and therefore credits) two ways; firstly, queries scanning many more partitions than necessary (basically an entire table scan), and secondly, if automatic clustering is turned on, extra costs re-clustering the table. Automatic Clustering incurs costs which you can track using the AUTOMATIC_CLUSTERING_HISTORY view to ensure query performance benefits outweigh costs.
It produces information similar to this example of a poorly clustered table, using UUID columns as clustering keys:
Right-Skewed: The vast majority of partitions should be concentrated in the lower-end buckets, ideally between 0 and 16. 00000 or 00001 is the largest bucket: this indicates most partitions have little to no overlap with others.
Low average depth: The average depth is close to 1.
Minimal Tail (not left skewed): Very little data in the higher-number buckets. If the buckets 00032, 00064, 00128+ have high counts, the table is not well-clustered.
High Constant Partition Count: a large number of partitions are in a “constant” state (meaning the data range is unique to that partition and does not overlap with others).
Low average overlaps: The average number of overlapping partitions is low.
A poorly clustered table, in contrast, will have a high percentage of partitions in the higher-numbered buckets (e.g., 00128 or higher). i.e. Left Skewed.
I always end up writing this again as I forget to save it!
It generates the SQL that you then need to execute. Advise caution on very large tables.
-- Generate SQL to find Max length of all of a table's text columns:
SELECT
'SELECT ' ||
LISTAGG('MAX(LEN(' || column_name || ') AS max_' || column_name || ' as ' || column_name, ', ') WITHIN GROUP (ORDER BY ordinal_position) ||
' FROM ' || table_schema || '.' || table_name
FROM
SVV_COLUMNS
WHERE
table_schema = 'mySchema' AND table_name= 'myTable'
AND data_type IN ('character', 'character varying')
GROUP BY
table_schema,
table_name
;
Queries with parameter sensitive plans can perform poorly when an inappropriate query plan is used.
Even if your statistics are up to date, parameter sensitive plans can be caused by skewed data, so performing a data skew analysis can identify which filter columns might be involved in poor query plans.
I’ve adapted the code found here into a SQL Server stored procedure that can be run across an entire database, a schema, a single table or just a single column.
It should be relatively easy to convert this to other RDBMS.
Here’s an example of the output when run on the Stackoverflow 2013 downloadable database (approximately 50GB):
First, get an actual query execution plan. Look for warnings in the query plan.
Basic
First look for large scans and lookups: these can often be resolved by creating a new index or extending an existing one with additional included columns. Seeks are usually preferable to scans.
Then look for significant variance of Actual versus Estimated row counts: you can provide the optimiser with more accurate information by updating statistics, creating new statistics objects, adding statistics on computed columns, or by breaking the query up into simpler parts. Might be caused by ‘parameter sniffing’.
Then look for expensive operators in the query plan, especially those that consume memory such as sorts and hashes. Sorting can sometimes be avoided by altering/adding indexes.
I’ve recently changed what I was calling the ‘Demo’ version of SQLFrontline, to a ‘Freemium’ model. The demo version only displayed one recommendation result in each of the four severity categories (Critical, High, Medium, Info).
The free version does not include all the features of the paid premium version obviously, but still provides some useful recommendations, providing advice on 40 checks.
Both use the same lightweight metadata collection.
The Free Version:
Performs 40 checks (out of 350+), but doesn’t show all affected objects if a check results in a recommendation
Deletes all collected metadata after collection
No reminder list is shown
Does not display a list of issues that are no longer present since last collection
Sends a single email for all servers
No database specific recommendations are made
Can only collect metadata for a limited number of servers
The Premium Version:
350+ checks performed across the categories of Configuration, Reliability, Performance, Security, Server Info, Table/Index Design
New checks are constantly being added
Reminder list of recommendations that have been made previously and not yet fixed
List of issues fixed compared to the last collection
Can choose how long to store collected metadata, so that point in time reports can be made, along with automated estimates of DB growth over time
Can send 1 email per server or a single email for all servers
Ability to ‘mute’ recommendations on an individual basis, or entire check (for non-critical checks)
No practical limit on the number of servers
If you want to try it out, click this link to request a free access token.
Almost every fact table in a data warehouse uses a date (or calendar) dimension, because most measurements are defined at specific points in time. A flexible calendar date dimension is at the heart of most data warehouse systems; it provides easy navigation of a fact table through user familiar dates, such as weeks, months, fiscal periods and special days (today, weekends, holidays etc.).
I’ve created a date dimension generator here at Github
It targets SQL Server, but should be easy to convert to other RDBMS.
It features:
User defined start and end dates
Computed Easter dates (for years 1901 to 2099)
Computed Chinese New year dates for years 1971 to 2099.
Computed public holidays for US, UK, Canada, Ireland, Malta, Philippines, Australia (with state specific for WA, NSW, QLD, SA, VIC).
Date labels in US, UK and ISO formats.
Things to Note:
The [TodayFlag] needs to be updated once per day by a scheduled task (timezone dependent: might need a flag for each timezone).
If you use an unusual Fiscal year (say 5-4-4), it will need to be loaded from an external source (such as an Excel/Google spreadsheet).
The precise start date of the month of Ramadan is by proclamation, so these need to be added, year by year. It is possible to calculate but can be a day out, and can vary by region.
Babelfish for PostgreSQLis an Apache-2.0 open source project that adds a Microsoft SQL Server-compatible end-point to PostgreSQL to enable your PostgreSQL database to understand the SQL Server wire protocol and commonly used SQL Server commands. With Babelfish, applications that were originally built for SQL Server can work directly with PostgreSQL, with little to no code changes, and without changing database drivers.