Snowflake SQL Code Syntax Shortcuts

Snowflake SQL includes several syntax extensions that can simplify your SQL code.

SELECT * EXCLUDE / REPLACE

Enables excluding a column list or transforming target columns when querying wide tables. This is useful on wide tables – stops you having to enumerate 30 columns to leave out or mask one. When combining these keywords, EXCLUDE must always be placed before REPLACE. Ref.

-- Drops sensitive columns while retrieving all others   
SELECT * EXCLUDE (ssn, credit_card_number)
FROM customers;

-- Replace a column value inline, keep everything else
SELECT * REPLACE (price * 1.1 AS price)
FROM products;

-- Combine both
SELECT * EXCLUDE (raw_json) REPLACE (upper(name) AS name)
FROM customers;

QUALIFY Clause

QUALIFY eliminates the most common subquery bloat pattern in analytics SQL. Filters window function results directly, eliminating the need to wrap code inside a nested Common Table Expression (CTE) or subquery in order to filter it. Ref.

Instead of writing:

SELECT 
    * 
FROM 
(
    SELECT 
        employee, 
        department, 
        salary,
        ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
    FROM 
        employees
) WHERE rn = 1;

use this:

SELECT 
   employee_id, 
   department, 
   salary,
   ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM 
    employees
QUALIFY rn = 1;

RESULT_SCAN – reuse cached query results

Query the result set of a previous statement by its query ID – no re-execution needed. This is a common idiom in Snowflake SQL. Ref.

-- Run an expensive query once
SELECT customer_id, SUM(revenue) AS ltv
FROM orders
GROUP BY customer_id;

-- Immediately reuse its result (free — uses cache)
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()))
WHERE ltv > 10000
ORDER BY ltv DESC
LIMIT 100;   

GROUP BY ALL

Eliminates the need to manually list out every non-aggregated column in your select clause. Ref.

SELECT country, city, department, SUM(sales) 
FROM sales_table 
GROUP BY ALL;

Note*: selecting and grouping by all columns is sometimes a code smell.


SELECT * ILIKE

Case-insensitive pattern matching. ILIKE ANY matches against multiple patterns in one expression. Ref.

-- Case-insensitive LIKE
SELECT * FROM products
WHERE category ILIKE '%widget%';

-- Match against multiple patterns at once
SELECT * FROM products
WHERE category ILIKE ANY ('%widget%', '%thingy%', '%gizmo%');

MIN_BY() and MAX_BY()

Finds the value of a column based on the minimum or maximum value of a completely different column without requiring a self-join. Ref.

-- Finds the name of the employee who has the absolute highest salary
SELECT MAX_BY(employee_name, salary) FROM employees;

ANY_VALUE Aggregate Function

Use ANY_VALUE() to pull an arbitrary value from a non-grouped column. Ref.

SELECT 
    user_id, 
    SUM(purchase_amount) AS total_spent,
    ANY_VALUE(latest_device_used) AS last_device
FROM 
    transactions
GROUP BY 
    user_id;

ARRAY_AGG and OBJECT_AGG

Can be used to simplify string concatenations or unnesting tables into multiple rows. Ref.

  • ARRAY_AGG: Groups multiple values into a single array per row

  • OBJECT_AGG: Pivots row-level attributes into a single JSON/Variant object

    -- Returns a JSON object like: {"Electronics": 1500, "Clothing": 400}
    SELECT 
        store_id, 
        OBJECT_AGG(category_name, total_sales) AS grp
    FROM 
        sales_data
    GROUP BY 
        store_id;

RATIO_TO_REPORT

Calculate each row’s proportion of a partition total without a self-join or subquery. Ref.

SELECT
    region,
    product,
    revenue,
    RATIO_TO_REPORT(revenue) OVER (PARTITION BY region) AS percent_of_region
FROM 
    sales;

Note*: The ORDER BY clause within the OVER clause is allowed in this function for syntactic consistency with other window functions but does not affect the calculation. Snowflake recommends not including the ORDER BY clause when using this function.


SAMPLE / TABLESAMPLE – fast approximate queries

Run queries on a random sample of rows or blocks for fast exploratory analysis. Ref.

-- Row-level sampling (Bernoulli) — ~10% of rows
SELECT * FROM orders SAMPLE (10);

-- Block-level sampling (system) — faster on large tables
SELECT * FROM orders SAMPLE SYSTEM (5);

-- Repeatable sample with seed
SELECT * FROM orders SAMPLE (10) SEED (97);

Note*: For SYSTEM or BLOCK sampling, the sample might be biased, in particular for small tables.


ASOF JOIN – nearest-match join

Join tables on the nearest (as-of) timestamp without expensive range join workarounds. For every row in the left table, it pairs with exactly one row from the right table that has the closest preceding or following timestamp. Ref.

-- Get the exchange rate in effect at the time of each order
SELECT
    o.order_id,
    o.order_ts,
    o.amount_usd,
    r.rate,
    o.amount_usd * r.rate AS amount_eur
FROM 
    orders o
ASOF JOIN fx_rates r
    MATCH_CONDITION (o.order_ts >= r.rate_ts) ON o.currency = r.currency;

INSERT OVERWRITE

Truncate-and-reload a table (or partition) atomically in one statement. Ref.

-- Atomically replace entire table
INSERT OVERWRITE INTO daily_summary
SELECT 
    date_trunc('day', order_ts) AS order_date,
    SUM(revenue) AS total_revenue
FROM 
    orders
GROUP BY 1;  

CREATE OR REPLACE + SWAP

Atomic table replacement: SWAP WITH lets you swap two tables instantly (blue/green deployments). Ref.

-- Rebuild without downtime
CREATE OR REPLACE TABLE orders_new AS
SELECT * FROM orders_staging WHERE is_valid = true;

-- Atomic swap — zero downtime
ALTER TABLE orders_new SWAP WITH orders;

-- Drop the old table (now named orders_new after swap)
DROP TABLE orders_new;

CLONE – zero-copy clone

Instantly clone tables, schemas, or entire databases. Uses no extra storage until data diverges. Ref.

-- Clone a table (instant, no storage cost)
CREATE TABLE orders_backup CLONE orders;

-- Clone to a point in time (time travel)
CREATE TABLE orders_yesterday CLONE orders
    AT (TIMESTAMP => DATEADD('day', -1, CURRENT_TIMESTAMP()));

-- Clone an entire schema for a dev environment
CREATE SCHEMA dev_schema CLONE prod_schema;

-- Clone an entire database
CREATE DATABASE mytestdb_dev_clone CLONE mytestdb_prod;

DIV0 and DIV0NULL

Performs safe division where divide-by-zero errors return 0 or NULL instead of throwing an error. Ref.

SELECT DIV0(10, 0); -- Returns 0

COUNT_IF

Returns the number of records that satisfy a condition or NULL if no records satisfy the condition. Ref.

SELECT 
    COUNT_IF(status = 'completed') AS num_orders_completed
FROM 
    orders

Rather than:

SELECT 
    SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS num_orders_completed
FROM 
    orders

Snowflake LLM Assisted Query Monitoring

If you don’t have a Snowflake DBA or you’re a Snowflake DBA that wants to get some assistence across all the databases in your account, you can leverage Snowflake’s built-in LLM functionality to look for and analyze expensive or poorly performing queries.

It can save you time by producing a documented starting point for identifying and addressing problems that might be costing more money than necessary.

Things often fall through the cracks because it’s no one’s job to constantly monitor and improve the query workload. This can result in higher than necessary compute costs and slow queries.

In Snowflake, this might fall into one or more categories (non-exhaustive):

  1. Poorly designed schema
  2. Queries using inefficient anti-patterns (such as NOT IN, or accidental JOIN explosions, …)
  3. Inefficient data loading patterns and long running ELT/ETL processes
  4. Poorly clustered tables and insufficient partition pruning (see Snowflake: Clustered Tables)

I built a project, in GitHub with a MIT License, that leverages Snowflake’s hosted LLM models to analyze your Snowflake query workload. It sends an HTML formatted email to an email address/email distribution list (preferrable).

When using Snowflake Cortex, your data never leaves Snowflake’s security boundaries. Customer data is strictly isolated within your account boundary and is never used to train or fine-tune third-party large language models (LLMs). See Snowflake AI Trust and Safety FAQs

Overview

The overview looks more complicated than it is! It’s basically: collect → analyze → email.

  • No external infrastructure is required. The whole system lives natively inside Snowflake: the DB, the scheduler (Task), the AI, and the email integration.
  • Isolated database + schema. MONITORING.AGENT is its own isolated database+schema that acts as a private audit trail. All four tables are append-only snapshots keyed by run_timestamp – so you get a full history of every monitoring run, not just the latest state. The AGENT_FINDINGS table stores the raw LLM prompt alongside the AI response, which gives you full reproducibility and lets you compare how findings change over time.
  • The execution chain is sequential. QUERY_MONITORING() is the sole entry point. It sets a QUERY_TAG = ‘QUERY_MONITORING’ at the start (so its own queries are excluded), then calls three procs in order: collect → analyze → email.
  • The AI layer sits inside Snowflake entirely. SEND_FINDINGS_TO_CORTEX() converts the query data into a Markdown table, crafts a detailed system prompt (senior DBA persona, output format specified as Outlook-compatible HTML tables, today’s date injected), then calls SNOWFLAKE.CORTEX.AI_COMPLETE() — the LLM model is claude-opus-4-7 by default but configurable. The AI output is stored in AGENT_FINDINGS before being consumed by the email proc.
  • The email is built entirely in SQL. SEND_FINDINGS_EMAIL() assembles a full HTML document then fires SYSTEM$SEND_EMAIL() via the notification integration. The Python to_html_table() helper (Snowpark, Python 3.13) renders Outlook-compatible zebra-striped tables with inline CSS.

Example Output

The format of the HTML output in the email can vary based on the model used. Here’s an example:

Snowflake: Find All Poorly Clustered Tables

(Part 1 gave an overview of Snowflake clustering)

I wanted a way to automatically identify all poorly clustered tables across the entire Snowflake account. ‘SYSTEM$CLUSTERING_INFORMATION’ doesn’t have a built-in way to run against all clustered tables, but it’s straight forward to create a stored procedure to automate this. If you have a lot of clustered tables across all your databases, this could take 4 – 5 minutes to run using a X-SMALL warehouse.

To use ‘SYSTEM$CLUSTERING_INFORMATION’, a role must have the following privileges:

  • USAGE on the database and schema containing the table.
  • SELECT on the table itself.

You can get the code here: https://github.com/Mitch-Wheat/Snowflake-PoorlyClusteredTables/blob/main/Find%20poorly%20clustered%20tables.sql

Snowflake: Clustered Tables

(This is Part 1 with a quick overview of the basics of clustering; Part 2 shows how to automatically detect poorly clustered tables.)

Clustering is probably one of the most misunderstood concepts in Snowflake. Snowflake clustering optimizes query performance on large tables (> 1 terabyte) by organizing data into partitions based on specific keys, enabling efficient partition pruning.

Clustering Key Recommendations:

  • Prioritize Filtering Columns: Choose columns used in WHERE clauses.
  • Optimal Cardinality: Aim for a balance. Too few values (e.g., 3-4) or too many (e.g., unique IDs such as UUIDs) won’t prune effectively, and make maintenance costly.
  • Date Truncation: Use DATE_TRUNC(‘DAY’, …) on timestamps to group data efficiently rather than clustering by the minute or second.
  • Limit Key Length: Use a maximum of 3 or 4 columns per key to avoid high maintenance costs.
  • Clustering keys are not intended for all tables due to the costs of initially clustering the data and maintaining the clustering.

When NOT to Cluster:

  • Small tables (< 1TB) that fit into a few micro-partitions.
  • Tables without a clear filter pattern.
  • Tables that are naturally inserted into in date order.
  • Tables with very high update/delete volume (high maintenance overhead).
    [You should consider Hybrid tables for these.]

Snowflake ref.: (https://docs.snowflake.com/en/user-guide/tables-clustering-keys)

Poor Choices for Clustering Keys:

  • GUIDs or any column containing unique values.
  • VARCHAR columns: Snowflake’s current implementation of clustering for VARCHAR columns only uses the first 5 to 6 bytes of the string. If the initial characters of your strings are identical (e.g., all starting with “HTTPS://”), clustering on the raw column will be ineffective for pruning.
  • Columns with too few values (Such as a BOOLEAN column)
  • Columns containing highly skewed data.

Clustering is justified when either:

  • You require the fastest possible response times, regardless of cost.
  • Your improved query performance offsets the credits required to cluster and maintain the table.

If a large table has been clustered with a poor choice of clustering key(s), you could be using
extra compute (and therefore credits) two ways; firstly, queries scanning many more partitions than necessary (basically an entire table scan), and secondly, if automatic clustering is turned on, extra costs re-clustering the table. Automatic Clustering incurs costs which you can track using the AUTOMATIC_CLUSTERING_HISTORY view to ensure query performance benefits outweigh costs.

It produces information similar to this example of a poorly clustered table, using UUID columns as clustering keys:

{
"cluster_by_keys" : "LINEAR(COL1_GUID, COL2_GUID)",
"total_partition_count" : 88327,
"total_constant_partition_count" : 116,
"average_overlaps" : 162.4355,
"average_depth" : 86.8738,
"partition_depth_histogram" : {
  "00000" : 0,
  "00001" : 116,
  "00002" : 0,
  "00003" : 0,
  "00004" : 0,
  "00005" : 0,
  "00006" : 0,
  "00007" : 1,
  "00008" : 0,
  "00009" : 102,
  "00010" : 0,
  "00011" : 0,
  "00012" : 1,
  "00013" : 2,
  "00014" : 1,
  "00015" : 0,
  "00016" : 3,
  "00032" : 44,
  "00064" : 4803,
  "00128" : 83359
 },
}

Characteristics of a Well-Clustered Histogram

  • Right-Skewed: The vast majority of partitions should be concentrated in the lower-end buckets, ideally between 0 and 16. 00000 or 00001 is the largest bucket: this indicates most partitions have little to no overlap with others.
  • Low average depth: The average depth is close to 1.
  • Minimal Tail (not left skewed): Very little data in the higher-number buckets. If the buckets 00032, 00064, 00128+ have high counts, the table is not well-clustered.
  • High Constant Partition Count: a large number of partitions are in a “constant” state (meaning the data range is unique to that partition and does not overlap with others).
  • Low average overlaps: The average number of overlapping partitions is low.

A poorly clustered table, in contrast, will have a high percentage of partitions in the higher-numbered buckets (e.g., 00128 or higher). i.e. Left Skewed.

Redshift: Generate SQL to Find the Max Length of a Table’s Text Columns

I always end up writing this again as I forget to save it!

It generates the SQL that you then need to execute. Advise caution on very large tables.

-- Generate SQL to find Max length of all of a table's text columns:
SELECT 
	'SELECT ' || 
	LISTAGG('MAX(LEN(' || column_name || ') AS max_' || column_name || ' as ' || column_name, ', ') WITHIN GROUP (ORDER BY ordinal_position) || 
	' FROM ' || table_schema || '.' || table_name
FROM 
	SVV_COLUMNS
WHERE 
	table_schema = 'mySchema' AND table_name= 'myTable'
	AND data_type IN ('character', 'character varying')
GROUP BY
	table_schema,
	table_name
;

Finding Columns with Skewed Data

Queries with parameter sensitive plans can perform poorly when an inappropriate query plan is used.

Even if your statistics are up to date, parameter sensitive plans can be caused by skewed data,
so performing a data skew analysis can identify which filter columns might be involved in poor query plans.

I’ve adapted the code found here into a SQL Server stored procedure that can be run across an entire database, a schema, a single table or just a single column.

It should be relatively easy to convert this to other RDBMS.

Here’s an example of the output when run on the Stackoverflow 2013 downloadable database (approximately 50GB):

SQL Server: Script out all indexes in a database

Kendra Little has a gist to script out all indexes HOW TO SCRIPT OUT INDEXES FROM SQL SERVER but it didn’t include XML or columnstore indexes, so I’ve forked her gist and added a few things to it. I changed the FOR XML/STUFF trick into STRING_AGG() (which is SQL Server 2017 onwards) for no other reason than I’m not working with any instance versions less than that.

The updated gist is here.

SQL Server: Compressing a Table and/or Indexes

I always forget whether the first syntax compresses the NC indexes as well, so posting here so I don’t forget again!

This compresses just the clustered index (i.e. the table data):

-- Just clustered index
ALTER TABLE dbo.Table
REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE);

This compresses all indexes including the clustered index:

-- All indexes including clustered index
ALTER INDEX ALL ON dbo.Table
REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE);

Solving the Monty Hall Problem using Bayes Theorem

The ‘Monty Hall’ problem is best described by the wikipedia entry:

The Monty Hall problem is a probability puzzle, loosely based on the American television game show Let’s Make a Deal and named after its original host, Monty Hall. The problem was originally posed (and solved) in a letter by Steve Selvin to the American Statistician in 1975. It became famous as a question from reader Craig F. Whitaker’s letter quoted in Marilyn vos Savant‘s “Ask Marilyn” column in Parade magazine in 1990.

Suppose you’re on a game show, and you’re given the choice of three doors A, B and C.

Behind one door is a car; behind the other two are goats. You pick a door, say door A, and the host, who knows what’s behind the doors, opens another door, say door B, which has a goat. He then says to you, “Do you want to change your choice to door C?”

Is it to your advantage to switch? (It is!)

Many readers of vos Savant’s column refused to believe switching is beneficial and rejected her explanation. After the problem appeared in Parade, approximately 10,000 readers, including nearly 1,000 with PhDs, wrote to the magazine, most of them calling vos Savant wrong.[4] Even when given explanations, simulations, and formal mathematical proofs, many people still did not accept that switching is the best strategy.[5]Paul Erdős, one of the most prolific mathematicians in history, remained unconvinced until he was shown a computer simulation demonstrating vos Savant’s predicted result.[6]

First we need to define some notation:

A, B – events
P(A) – the probability of event A occurring
P(B) – the probability of event B occurring
P(A | B) – the probability of event A occurring, given that event B has already occurred
P(B | A) – the probability of event B occurring, given that event A has already occurred

Bayes Theorem is defined as:

The following reasoning is from Julian Havil’s book “Impossible?”

Assign symbols to the events:

A – the event “car is behind door A”
B – the event “car is behind door B”
C – the event “car is behind door C”
MA – the event “Monty opens door A” … similarly for MB , MC

Assume door A is chosen initially by the player, so Monty can open door B or C:

P(MB | A) = ½ ,   P(MB | B) = 0,   P(MB | C) = 1

So, since A, B and C are mutually exclusive events:

P(MB) = P(MB | A)P(A) + P(MB | B)P(B) + P(MB | C)P(C) = ½ x ⅓ + 0 x ⅓ + 1 x ⅓ = ½ 

Now, the player can stick or change. If they stick with door A, their probability of winning the car is:

P(A | MB) = P(MB | A)P(A) / P(MB) = (½ x ⅓) / ½ = 

If they switch to door C, their probability of winning the car is:

P(C | MB) = P(MB | C)P(C) / P(MB) = (1 x ⅓) / ½ =

Performance Improvements in .NET 6

There have been a large number of Performance Improvements in .NET 6 as evidenced in Stephen Toub’s blog post. Most of the time we don’t care about assembly level performance optimisations because the bottleneck is usually accessing some external resource, such as a database or web service.

If you need to benchmark .NET code take a look at a great tool, BenchmarkDotNet and also take a look at the book referenced there, Pro .NET Benchmarking as getting benchmarking correct can sometimes be quite tricky.

If you’re not yet on .NET 6, and you have code in a large loop that you want to squeeze some performance out of it (and it won’t make the code hard to understand and maintain!) here are a couple of simple tips:

  • Testing if n is even: replace (n % 2 == 0) with ((n & 1) == 0)
  • Dividing by 2: replace n / 2 by n >> 1 (but be aware of the unsigned / signed behaviour of right shift)