Snowflake: Clustered Tables

(This is Part 1 with a quick overview of the basics of clustering; Part 2 shows how to automatically detect poorly clustered tables.)

Clustering is probably one of the most misunderstood concepts in Snowflake. Snowflake clustering optimizes query performance on large tables (> 1 terabyte) by organizing data into partitions based on specific keys, enabling efficient partition pruning.

Clustering Key Recommendations:

Prioritize Filtering Columns: Choose columns used in WHERE clauses.
Optimal Cardinality: Aim for a balance. Too few values (e.g., 3-4) or too many (e.g., unique IDs such as UUIDs) won’t prune effectively, and make maintenance costly.
Date Truncation: Use DATE_TRUNC(‘DAY’, …) on timestamps to group data efficiently rather than clustering by the minute or second.
Limit Key Length: Use a maximum of 3 or 4 columns per key to avoid high maintenance costs.
Clustering keys are not intended for all tables due to the costs of initially clustering the data and maintaining the clustering.

When NOT to Cluster:

Small tables (< 1TB) that fit into a few micro-partitions.
Tables without a clear filter pattern.
Tables that are naturally inserted into in date order.
Tables with very high update/delete volume (high maintenance overhead).
[You should consider Hybrid tables for these.]

Snowflake ref.: (https://docs.snowflake.com/en/user-guide/tables-clustering-keys)

Poor Choices for Clustering Keys:

GUIDs or any column containing unique values.
VARCHAR columns: Snowflake’s current implementation of clustering for VARCHAR columns only uses the first 5 to 6 bytes of the string. If the initial characters of your strings are identical (e.g., all starting with “HTTPS://”), clustering on the raw column will be ineffective for pruning.
Columns with too few values (Such as a BOOLEAN column)
Columns containing highly skewed data.

Clustering is justified when either:

You require the fastest possible response times, regardless of cost.
Your improved query performance offsets the credits required to cluster and maintain the table.

If a large table has been clustered with a poor choice of clustering key(s), you could be using
extra compute (and therefore credits) two ways; firstly, queries scanning many more partitions than necessary (basically an entire table scan), and secondly, if automatic clustering is turned on, extra costs re-clustering the table. Automatic Clustering incurs costs which you can track using the AUTOMATIC_CLUSTERING_HISTORY view to ensure query performance benefits outweigh costs.

It produces information similar to this example of a poorly clustered table, using UUID columns as clustering keys:

{
"cluster_by_keys" : "LINEAR(COL1_GUID, COL2_GUID)",
"total_partition_count" : 88327,
"total_constant_partition_count" : 116,
"average_overlaps" : 162.4355,
"average_depth" : 86.8738,
"partition_depth_histogram" : {
  "00000" : 0,
  "00001" : 116,
  "00002" : 0,
  "00003" : 0,
  "00004" : 0,
  "00005" : 0,
  "00006" : 0,
  "00007" : 1,
  "00008" : 0,
  "00009" : 102,
  "00010" : 0,
  "00011" : 0,
  "00012" : 1,
  "00013" : 2,
  "00014" : 1,
  "00015" : 0,
  "00016" : 3,
  "00032" : 44,
  "00064" : 4803,
  "00128" : 83359
 },
}

Characteristics of a Well-Clustered Histogram

Right-Skewed: The vast majority of partitions should be concentrated in the lower-end buckets, ideally between 0 and 16. 00000 or 00001 is the largest bucket: this indicates most partitions have little to no overlap with others.
Low average depth: The average depth is close to 1.
Minimal Tail (not left skewed): Very little data in the higher-number buckets. If the buckets 00032, 00064, 00128+ have high counts, the table is not well-clustered.
High Constant Partition Count: a large number of partitions are in a “constant” state (meaning the data range is unique to that partition and does not overlap with others).
Low average overlaps: The average number of overlapping partitions is low.

A poorly clustered table, in contrast, will have a high percentage of partitions in the higher-numbered buckets (e.g., 00128 or higher). i.e. Left Skewed.