(This is Part 1 with a quick overview of the basics of clustering; Part 2 shows how to automatically detect poorly clustered tables.)
Clustering is probably one of the most misunderstood concepts in Snowflake. Snowflake clustering optimizes query performance on large tables (> 1 terabyte) by organizing data into partitions based on specific keys, enabling efficient partition pruning.
Clustering Key Recommendations:
- Prioritize Filtering Columns: Choose columns used in WHERE clauses.
- Optimal Cardinality: Aim for a balance. Too few values (e.g., 3-4) or too many (e.g., unique IDs such as UUIDs) won’t prune effectively, and make maintenance costly.
- Date Truncation: Use DATE_TRUNC(‘DAY’, …) on timestamps to group data efficiently rather than clustering by the minute or second.
- Limit Key Length: Use a maximum of 3 or 4 columns per key to avoid high maintenance costs.
- Clustering keys are not intended for all tables due to the costs of initially clustering the data and maintaining the clustering.
When NOT to Cluster:
- Small tables (< 1TB) that fit into a few micro-partitions.
- Tables without a clear filter pattern.
- Tables that are naturally inserted into in date order.
- Tables with very high update/delete volume (high maintenance overhead).
[You should consider Hybrid tables for these.]
Snowflake ref.: (https://docs.snowflake.com/en/user-guide/tables-clustering-keys)
Poor Choices for Clustering Keys:
- GUIDs or any column containing unique values.
- VARCHAR columns: Snowflake’s current implementation of clustering for VARCHAR columns only uses the first 5 to 6 bytes of the string. If the initial characters of your strings are identical (e.g., all starting with “HTTPS://”), clustering on the raw column will be ineffective for pruning.
- Columns with too few values (Such as a BOOLEAN column)
- Columns containing highly skewed data.
Clustering is justified when either:
- You require the fastest possible response times, regardless of cost.
- Your improved query performance offsets the credits required to cluster and maintain the table.
If a large table has been clustered with a poor choice of clustering key(s), you could be using
extra compute (and therefore credits) two ways; firstly, queries scanning many more partitions than necessary (basically an entire table scan), and secondly, if automatic clustering is turned on, extra costs re-clustering the table. Automatic Clustering incurs costs which you can track using the AUTOMATIC_CLUSTERING_HISTORY view to ensure query performance benefits outweigh costs.
It produces information similar to this example of a poorly clustered table, using UUID columns as clustering keys:
{
"cluster_by_keys" : "LINEAR(COL1_GUID, COL2_GUID)",
"total_partition_count" : 88327,
"total_constant_partition_count" : 116,
"average_overlaps" : 162.4355,
"average_depth" : 86.8738,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 116,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 1,
"00008" : 0,
"00009" : 102,
"00010" : 0,
"00011" : 0,
"00012" : 1,
"00013" : 2,
"00014" : 1,
"00015" : 0,
"00016" : 3,
"00032" : 44,
"00064" : 4803,
"00128" : 83359
},
}
Characteristics of a Well-Clustered Histogram
- Right-Skewed: The vast majority of partitions should be concentrated in the lower-end buckets, ideally between 0 and 16. 00000 or 00001 is the largest bucket: this indicates most partitions have little to no overlap with others.
- Low average depth: The average depth is close to 1.
- Minimal Tail (not left skewed): Very little data in the higher-number buckets. If the buckets 00032, 00064, 00128+ have high counts, the table is not well-clustered.
- High Constant Partition Count: a large number of partitions are in a “constant” state (meaning the data range is unique to that partition and does not overlap with others).
- Low average overlaps: The average number of overlapping partitions is low.
A poorly clustered table, in contrast, will have a high percentage of partitions in the higher-numbered buckets (e.g., 00128 or higher). i.e. Left Skewed.
