Snowflake: Find All Poorly Clustered Tables

(Part 1 gave an overview of Snowflake clustering)

I wanted a way to automatically identify all poorly clustered tables across the entire Snowflake account. ‘SYSTEM$CLUSTERING_INFORMATION’ doesn’t have a built-in way to run against all clustered tables, but it’s straight forward to create a stored procedure to automate this. If you have a lot of clustered tables across all your databases, this could take 4 – 5 minutes to run using a X-SMALL warehouse.

To use ‘SYSTEM$CLUSTERING_INFORMATION’, a role must have the following privileges:

  • USAGE on the database and schema containing the table.
  • SELECT on the table itself.

You can get the code here: https://github.com/Mitch-Wheat/Snowflake-PoorlyClusteredTables/blob/main/Find%20poorly%20clustered%20tables.sql

Snowflake: Clustered Tables

(This is Part 1 with a quick overview of the basics of clustering; Part 2 shows how to automatically detect poorly clustered tables.)

Clustering is probably one of the most misunderstood concepts in Snowflake. Snowflake clustering optimizes query performance on large tables (> 1 terabyte) by organizing data into partitions based on specific keys, enabling efficient partition pruning.

Clustering Key Recommendations:

  • Prioritize Filtering Columns: Choose columns used in WHERE clauses.
  • Optimal Cardinality: Aim for a balance. Too few values (e.g., 3-4) or too many (e.g., unique IDs such as UUIDs) won’t prune effectively, and make maintenance costly.
  • Date Truncation: Use DATE_TRUNC(‘DAY’, …) on timestamps to group data efficiently rather than clustering by the minute or second.
  • Limit Key Length: Use a maximum of 3 or 4 columns per key to avoid high maintenance costs.
  • Clustering keys are not intended for all tables due to the costs of initially clustering the data and maintaining the clustering.

When NOT to Cluster:

  • Small tables (< 1TB) that fit into a few micro-partitions.
  • Tables without a clear filter pattern.
  • Tables that are naturally inserted into in date order.
  • Tables with very high update/delete volume (high maintenance overhead).
    [You should consider Hybrid tables for these.]

Snowflake ref.: (https://docs.snowflake.com/en/user-guide/tables-clustering-keys)

Poor Choices for Clustering Keys:

  • GUIDs or any column containing unique values.
  • VARCHAR columns: Snowflake’s current implementation of clustering for VARCHAR columns only uses the first 5 to 6 bytes of the string. If the initial characters of your strings are identical (e.g., all starting with “HTTPS://”), clustering on the raw column will be ineffective for pruning.
  • Columns with too few values (Such as a BOOLEAN column)
  • Columns containing highly skewed data.

Clustering is justified when either:

  • You require the fastest possible response times, regardless of cost.
  • Your improved query performance offsets the credits required to cluster and maintain the table.

If a large table has been clustered with a poor choice of clustering key(s), you could be using
extra compute (and therefore credits) two ways; firstly, queries scanning many more partitions than necessary (basically an entire table scan), and secondly, if automatic clustering is turned on, extra costs re-clustering the table. Automatic Clustering incurs costs which you can track using the AUTOMATIC_CLUSTERING_HISTORY view to ensure query performance benefits outweigh costs.

It produces information similar to this example of a poorly clustered table, using UUID columns as clustering keys:

{
"cluster_by_keys" : "LINEAR(COL1_GUID, COL2_GUID)",
"total_partition_count" : 88327,
"total_constant_partition_count" : 116,
"average_overlaps" : 162.4355,
"average_depth" : 86.8738,
"partition_depth_histogram" : {
  "00000" : 0,
  "00001" : 116,
  "00002" : 0,
  "00003" : 0,
  "00004" : 0,
  "00005" : 0,
  "00006" : 0,
  "00007" : 1,
  "00008" : 0,
  "00009" : 102,
  "00010" : 0,
  "00011" : 0,
  "00012" : 1,
  "00013" : 2,
  "00014" : 1,
  "00015" : 0,
  "00016" : 3,
  "00032" : 44,
  "00064" : 4803,
  "00128" : 83359
 },
}

Characteristics of a Well-Clustered Histogram

  • Right-Skewed: The vast majority of partitions should be concentrated in the lower-end buckets, ideally between 0 and 16. 00000 or 00001 is the largest bucket: this indicates most partitions have little to no overlap with others.
  • Low average depth: The average depth is close to 1.
  • Minimal Tail (not left skewed): Very little data in the higher-number buckets. If the buckets 00032, 00064, 00128+ have high counts, the table is not well-clustered.
  • High Constant Partition Count: a large number of partitions are in a “constant” state (meaning the data range is unique to that partition and does not overlap with others).
  • Low average overlaps: The average number of overlapping partitions is low.

A poorly clustered table, in contrast, will have a high percentage of partitions in the higher-numbered buckets (e.g., 00128 or higher). i.e. Left Skewed.

Windows File Splitter

Unlike Linux systems, Windows doesn’t have a built-in command line file splitter. Splitting large files into smaller chunks is something you often want to with data warehouses (such as Snowflake), in order to be able to use multiple threads for better bulk loading performance.

I saw a post by Greg Low, SDU_FileSplit – Free utility for splitting CSV and other text files in Windows where he had created one for Windows, but it wasn’t open-source.

I decided to create an open-source file splitter for Windows.

It’s a standalone .NET 5.0 executable, supports wildcards, can recurse sub-folders (careful with that option!) and automatically gzip compress output files.

Example use:

FileSplitter.exe -i c:\temp\*.csv -m 100000 -c -o c:\temp\TestFileSplitter

Splits all the files in specified folder c:\temp with extension .csv, and splits into max. 100K lines per file, storing the output files in folder c:\temp\TestFileSplitter