Common Table Expression

Does Using TOP In A SQL Server Common Table Expression Materialize It?

Posted on March 22, 2022May 16, 2022 by Erik Darling

Nope.

The sad news for you here is that nothing aside from selecting a CTE into a real or temporary table will materialize the result of the query within in it.

WITH 
    cte AS
(
    SELECT TOP (1)
        u.Id
    FROM dbo.Users AS u
)
SELECT
    c1.*
FROM cte AS c1
JOIN cte AS c2
    ON c1.Id = c2.Id
JOIN cte AS c3
    ON c1.Id = c3.Id;

This query will still have to touch the Users table three times. I’ve blogged about this part before, of course.

Bounce

You may notice something interesting in there, though, once you get past the disappointment of seeing three scans of the Users table.

Each scan is preceded by the TOP operator. This can sometimes be where people confuse the behavior of TOP in a Common Table Expression or Derived Table.

It’s not a physical manifestation of the data into an object, but (at least for now) it is a logical separation of the query semantics.

In short, it’s a fence.

The reason why it’s a fence is because using TOP sets a row goal, and the optimizer has to try to meet (but not exceed) that row goal for whatever part of the query is underneath it.

Strange

Take this query for example, which loads a bunch of work into a Common Table Expression with a TOP in it:

WITH
    Posts AS
(
    SELECT TOP (1000)
         p.*
    FROM dbo.Posts AS p
    WHERE p.PostTypeId = 1
    AND   p.Score > 500
    AND   EXISTS
          (
              SELECT
                  1/0
               FROM dbo.Users AS u
               WHERE u.Id = p.OwnerUserId
          )
    AND EXISTS
        (
            SELECT
                1/0
            FROM dbo.Badges AS b
            WHERE b.UserId = p.OwnerUserId
        )
    ORDER BY p.Score DESC
)
SELECT
    u.DisplayName,
    p.PostTypeId,
    p.Score
FROM dbo.Users AS u
JOIN Posts AS p
    ON p.OwnerUserId = u.Id
ORDER BY u.Reputation DESC;

And the plan for it looks like this:

All the work within the Common Table Expression is fenced by the top.

There are many times you can use this to your advantage, when you know certain joins or predicates can produce a very selective result.

Care Control

As a final note, just be really careful how you position your TOPs. They insert loads of semantic differences to the query.

If you don’t believe me, put a TOP in the wrong place and watch your results change dramatically.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Software Vendor Mistakes With SQL Server: Misusing Common Table Expressions

Posted on January 21, 2022May 16, 2022 by Erik Darling

Stinko

In this post, I’m gonna show you how stringing together a bunch of CTEs can cause performance problems with one of my paid training videos. If you like it, hit the link below to get 75% off the entire bundle.

Thanks for reading!

Going Further

Things SQL Server vNext Should Address: Common Table Expression Materialization

Posted on May 25, 2021May 16, 2022 by Erik Darling

Repetition Is Everything

I know what you’re thinking: this is another post that asks for a hint to materialize CTEs.

You’re wrong. I don’t want another hint that I can’t add to queries to solve a problem because the code is coming from a vendor or ORM.

No, I want the optimizer to smarten up about this sort of thing, detect CTE re-use, and use one of the New And Improved Spools™ to cache results.

Let’s take a look at where this would come in handy.

Standalone

If we take this query by itself and look at the execution plan, it conveniently shows one access of Posts and Users, and a single join between the two.

SELECT
    u.Id AS UserId,
    u.DisplayName,
    p.Id AS PostId,
    p.AcceptedAnswerId,
    TotalScore = 
        SUM(p.Score)
FROM dbo.Users AS u
JOIN dbo.Posts AS p
    ON p.OwnerUserId = u.Id
WHERE u.Reputation > 100000
GROUP BY 
    u.Id, 
    u.DisplayName,
    p.Id, 
    p.AcceptedAnswerId
HAVING SUM(p.Score) > 1

Now, let’s go MAKE THINGS MORE READABLE!!!

Ality

WITH spool_me AS
(
    SELECT
        u.Id AS UserId,
        u.DisplayName,
        p.Id AS PostId,
        p.AcceptedAnswerId,
        TotalScore = 
            SUM(p.Score)
    FROM dbo.Users AS u
    JOIN dbo.Posts AS p
        ON p.OwnerUserId = u.Id
    WHERE u.Reputation > 100000
    GROUP BY 
        u.Id, 
        u.DisplayName,
        p.Id, 
        p.AcceptedAnswerId
    HAVING SUM(p.Score) > 1
)
SELECT
    a.UserId,
    a.DisplayName,
    a.PostId,
    a.AcceptedAnswerId,
    a.TotalScore,
    q.UserId,
    q.DisplayName,
    q.PostId,
    q.AcceptedAnswerId,
    q.TotalScore
FROM spool_me AS a
JOIN spool_me AS q
    ON a.PostId = q.AcceptedAnswerId
ORDER BY a.TotalScore DESC;

Wowee. We really done did it. But now what does the query plan look like?

There are now two accesses of Posts and two accesses of Users, and three joins (one Hash Join isn’t in the screen cap).

Detection

Obviously, the optimizer knows it has to build a query plan that reflects the CTE being joined.

Since it’s smart enough to do that, it should be smart enough to use a Spool to cache things and prevent the additional accesses.

Comparatively, using a #temp table to simulate a Spool, is about twice as fast. Here’s the CTE plan:

Here’s the Spool Simulator Plan™

Given the optimizer’s penchant for spools, this would be another chance for it to shine on like the crazy diamond it is.

Thanks for reading!

Going Further

CTEs Don’t Make Queries More Readable, Formatting Does

Posted on May 14, 2021May 16, 2022 by Erik Darling

Kākāpō!

One line I see over and over again — I’ve probably said it too when I was young and needed words to fill space — is that CTEs make queries more readable.

Personally, I don’t think they make queries any more readable than derived tables, but whatever. No one cares what I think, anyway.

Working with clients I see a variety of query formatting styles, ranging from quite nice ones that have influenced the way I format things, to completely unformed primordial blobs. Sticking the latter into a CTE does nothing for readability even if it’s commented to heck and back.

There are a number of options for formatting code:

Good Standing

Formatting your code nicely doesn’t just help others read it, it can also help people understand how it works.

Take this example from sp_QuickieStore that uses the STUFF function to build a comma separated list the crappy way.

If STRING_AGG were available in SQL Server 2016, I’d just use that. Darn legacy software.

The text I added probably made things less readable, but formatting the code this way helps me make sure I have everything right.

The opening and closing parens for the STUFF function
The first input to the function is the XML generating nonsense
The last three inputs to the STUFF function that identify the start, length, and replacement text

I’ve seen and used this specific code a million times, but it wasn’t until I formatted it this way that I understood how all the pieces lined up.

Compare that with another time I used the same code fragment in sp_BlitzCache. I wish I had formatted a lot of the stuff I wrote in there better.

With things written this way, it’s really hard to understand where things begin and end and that arguments belong to which part of the code.

Maybe someday I’ll open an issue to reformat all the FRK code ?

Thanks for reading!

Going Further

Common Table Expression Myths In SQL Server: Materialization

Posted on July 13, 2020May 16, 2022 by Erik Darling

I Got Five On It

I wish I had a dollar for every wrong thing I’ve heard about CTEs in my life. I’d buy a really nice cigar and light it with fire made by the gods themselves.

Or, you know, something like that.

One common thing is around some persistence of the queries contained inside of them in some form, whether in memory or in tempdb or something else.

I honestly don’t know where these things begin. They’re not even close to reality.

Getting It Right

Let’s take this query as an example:

SELECT u.Id, 
       u.Reputation
FROM dbo.Users AS u
WHERE u.Reputation * 2 = 22;

If you’ve been tuning queries for longer than a day, you can probably spot the issue here.

Applying expressions to columns in the where clause (or joins) messes up some things. Unfortunately, you can also run into the exact same issues doing this:

WITH cte AS 
(
SELECT u.Id, 
       u.Reputation, 
       (u.Reputation * 2) AS ReputationDoubler
FROM dbo.Users AS u
)
SELECT c.Id,
       c.Reputation
FROM cte AS c
WHERE c.ReputationDoubler = 22;

To be explicit: both of these queries have the same problem.

Erik D Is President

Starting with this index:

CREATE INDEX toodles ON dbo.Users(Reputation);

Both queries have the same execution plan characteristics:

I understand why you think a mature database product might be able to deal with this better:

Locate values in the index with a value of 11
Divide the literal value by 2 instead

But SQL Server doesn’t have anything like that, and neither do CTEs. Both indexes get scanned in entirety to retrieve qualifying rows, with the unseekable expression applied as a residual predicate:

SQL Server Query Plan Tool Tip — Day Planner

Gopherville

To be clear, and hopefully to persuade you to write clear predicates, this is the end result that we’re after:

SELECT u.Id, 
       u.Reputation
FROM dbo.Users AS u
WHERE u.Reputation = 11;

While this is of course intuitive when writing simple queries, the point of this post is to show that expressions in CTEs don’t offer any advantage.

This goes for any flavor of derivation, too. Whether it’s wrapping columns in built in or user defined functions, combining columns, combining columns with values, etc.

Thanks for reading!

Going Further

What Would Materialized CTEs Look Like In SQL Server?

Posted on August 14, 2019May 14, 2022 by Erik Darling

Strange Expectations

A lot of people still expect odd things from CTEs.

Performance fences
Cached results

There’s no clue in how they’re written that you won’t get those.

I’ve gone back and forth on whether or not this would be worthwhile. It totally could be, but it’d have to be pretty thoughtful.

Materialization vs. Fencing

The difference here is subtle but necessary. Right now, people will use TOP, which sets a row goal, and provides some logical isolation of the query in your CTE.

The problem remains that if that CTE is referenced via join > 1 time, the internal syntax is re-run each time.

Even if your query is fenced off, it is not materialized.

Fencing could leverage existing NOEXPAND hints, but materialization would likely require a new hint that performed the equivalent of SELECT… INTO #t, and then replaced references to the CTE alias with a pointer to the temporary object.

Indexing

One appeal of temp tables is that there is additional indexing flexibility, so any syntax would have to allow existing inline index syntax of temp tables to be used.

In other words, an index that may not make sense on a real table given your existing workload might make sense on a temp table. Or like, if a temp table is the result of joining two tables together, there could be a compound index you could create on the temp table that’s otherwise impossible to create.

Next feature request: multi-table indexes ?

Thanks for reading!