Snowflake: Find All Poorly Clustered Tables

(Part 1 gave an overview of Snowflake clustering)

I wanted a way to automatically identify all poorly clustered tables across the entire Snowflake account. ‘SYSTEM$CLUSTERING_INFORMATION’ doesn’t have a built-in way to run against all clustered tables, but it’s straight forward to create a stored procedure to automate this. If you have a lot of clustered tables across all your databases, this could take 4 – 5 minutes to run using a X-SMALL warehouse.

To use ‘SYSTEM$CLUSTERING_INFORMATION’, a role must have the following privileges:

  • USAGE on the database and schema containing the table.
  • SELECT on the table itself.

You can get the code here: https://github.com/Mitch-Wheat/Snowflake-PoorlyClusteredTables/blob/main/Find%20poorly%20clustered%20tables.sql

Snowflake: Clustered Tables

(This is Part 1 with a quick overview of the basics of clustering; Part 2 shows how to automatically detect poorly clustered tables.)

Clustering is probably one of the most misunderstood concepts in Snowflake. Snowflake clustering optimizes query performance on large tables (> 1 terabyte) by organizing data into partitions based on specific keys, enabling efficient partition pruning.

Clustering Key Recommendations:

  • Prioritize Filtering Columns: Choose columns used in WHERE clauses.
  • Optimal Cardinality: Aim for a balance. Too few values (e.g., 3-4) or too many (e.g., unique IDs such as UUIDs) won’t prune effectively, and make maintenance costly.
  • Date Truncation: Use DATE_TRUNC(‘DAY’, …) on timestamps to group data efficiently rather than clustering by the minute or second.
  • Limit Key Length: Use a maximum of 3 or 4 columns per key to avoid high maintenance costs.
  • Clustering keys are not intended for all tables due to the costs of initially clustering the data and maintaining the clustering.

When NOT to Cluster:

  • Small tables (< 1TB) that fit into a few micro-partitions.
  • Tables without a clear filter pattern.
  • Tables that are naturally inserted into in date order.
  • Tables with very high update/delete volume (high maintenance overhead).
    [You should consider Hybrid tables for these.]

Snowflake ref.: (https://docs.snowflake.com/en/user-guide/tables-clustering-keys)

Poor Choices for Clustering Keys:

  • GUIDs or any column containing unique values.
  • VARCHAR columns: Snowflake’s current implementation of clustering for VARCHAR columns only uses the first 5 to 6 bytes of the string. If the initial characters of your strings are identical (e.g., all starting with “HTTPS://”), clustering on the raw column will be ineffective for pruning.
  • Columns with too few values (Such as a BOOLEAN column)
  • Columns containing highly skewed data.

Clustering is justified when either:

  • You require the fastest possible response times, regardless of cost.
  • Your improved query performance offsets the credits required to cluster and maintain the table.

If a large table has been clustered with a poor choice of clustering key(s), you could be using
extra compute (and therefore credits) two ways; firstly, queries scanning many more partitions than necessary (basically an entire table scan), and secondly, if automatic clustering is turned on, extra costs re-clustering the table. Automatic Clustering incurs costs which you can track using the AUTOMATIC_CLUSTERING_HISTORY view to ensure query performance benefits outweigh costs.

It produces information similar to this example of a poorly clustered table, using UUID columns as clustering keys:

{
"cluster_by_keys" : "LINEAR(COL1_GUID, COL2_GUID)",
"total_partition_count" : 88327,
"total_constant_partition_count" : 116,
"average_overlaps" : 162.4355,
"average_depth" : 86.8738,
"partition_depth_histogram" : {
  "00000" : 0,
  "00001" : 116,
  "00002" : 0,
  "00003" : 0,
  "00004" : 0,
  "00005" : 0,
  "00006" : 0,
  "00007" : 1,
  "00008" : 0,
  "00009" : 102,
  "00010" : 0,
  "00011" : 0,
  "00012" : 1,
  "00013" : 2,
  "00014" : 1,
  "00015" : 0,
  "00016" : 3,
  "00032" : 44,
  "00064" : 4803,
  "00128" : 83359
 },
}

Characteristics of a Well-Clustered Histogram

  • Right-Skewed: The vast majority of partitions should be concentrated in the lower-end buckets, ideally between 0 and 16. 00000 or 00001 is the largest bucket: this indicates most partitions have little to no overlap with others.
  • Low average depth: The average depth is close to 1.
  • Minimal Tail (not left skewed): Very little data in the higher-number buckets. If the buckets 00032, 00064, 00128+ have high counts, the table is not well-clustered.
  • High Constant Partition Count: a large number of partitions are in a “constant” state (meaning the data range is unique to that partition and does not overlap with others).
  • Low average overlaps: The average number of overlapping partitions is low.

A poorly clustered table, in contrast, will have a high percentage of partitions in the higher-numbered buckets (e.g., 00128 or higher). i.e. Left Skewed.