T-SQL – Page 3 – Darling Data

Understand Your Plan: Getting Data With A Scan Operator

Posted on March 2, 2022May 16, 2022 by Erik Darling

They Live

I’m not sure why scans got such a bad rap. Perhaps it’s a leftover from the Bad Old Days© when people worried about logical fragmentation and page splits.

What I mean to say is: scans are often treated with a level of revulsion and focus that distracts people from larger issues, and is often due to some malpractice on their part.

Sure, scans can be bad. Sometimes you do need to fix them. But make that decision when you’re looking at an actual execution plan, and not just making meme-ish guesses based on costs and estimates.

You’re better than that.

It Follows

Let’s start with a simple query:

SELECT TOP (1)
    p.*
FROM dbo.Posts AS p;

There’s no where clause, no join, and we’ll get a scan, but it’s driven by the TOP.

If we get look at the actual execution plan, and hit F4/get the Properties of the clustered index scan, we’ll see it does a minimal number of reads.

The four reads to get a single row are certainly not the number of reads it would take to read this entire Posts table.

The reason it’s not 1 read is because the Posts table contains the nvarchar(max) Body column, which leads to reading additional off row/LOB pages.

C.H.U.D

Let’s look at a situation where you might see a clustered index scan(!), a missing index request(!), and think you’ll have the problem solved easily(!), but… no.

SELECT TOP (1000)
    u.DisplayName,
    p.Score,
    _ = dbo.InnocentScan(p.OwnerUserId)
FROM dbo.Posts AS p
JOIN dbo.Users AS u
    ON p.OwnerUserId = u.Id
WHERE p.Score = 1
ORDER BY p.Id;

Costs are really wacky sometimes. Like here.

In this really unhelpful query plan (thanks, SSMS, for not showing what the clustered thing is), we have:

A clustered index scan of Posts
A clustered index seek of Users

The optimizer tells us that adding an index will reduce the estimated effort needed to execute the query by around 18%. Mostly because the estimated effort needed to scan the clustered index is 19% of the total operator costs.

CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Posts] ([Score])
INCLUDE ([OwnerUserId])

Got it? Missing index requests don’t care about joins.

Geistbusters

If one were to look at the actual execution plan, one may come to the conclusion that an index here would not solve a gosh. darn. thing.

21 seconds is spent in the operator that costs 0%, and less than 40 milliseconds is spent between the two operators that make up 100% of the plan cost.

Very sad for you if you thought that missing index request would solve all your problems.

What About Scans?

I know, the title of this post is about retrieving data via an index scan, but we ended up talking about how scans aren’t always the root of performance issues.

It ended up that way because as I was writing this, I had to help a client with an execution plan where the problem had nothing to do with a clustered index scan that they were really worried about.

To call back to a few other points about things I’ve made so far in this series:

The operator times in actual execution plans are the most important thing for you to look at
If you’re not getting a seek when you think you should, ask yourself some questions:
Do I have an index that I can actually seek to data in?
Do I have a query written to take advantage of my indexes?

If you’re not sure, hit the link below. This is the kind of stuff I love helping folks out with.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

Understand Your Plan: Getting Data With A Seek Operator

Posted on March 1, 2022May 16, 2022 by Erik Darling

Rumble, Young Man, Rumble

People. People complain. People complain about SQL Server. That’s probably why I get paid to deal with it, but whatever.

One complaint I get to hear week in and week out is that SQL Server isn’t using someone’s index, or that there are too many index scans and they’re slow.

That might actually be a composite of like twelve complaints, but let’s not start counting.

Usually when we start examining the queries, query plans, and indexes for these renegades, the reasons for the lack of a seek become apparent.

There’s no good index to seek to
The query is written in a way that seeks can’t happen
A predicate is on two columns

Desperately Seeking Susan

In a query that doesn’t have any of those problems, you’ll naturally get a seek and feel really good about yourself.

CREATE INDEX v ON dbo.Votes(PostId)
WITH(SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE v.PostId = 194812;

We’re set for success! And look how happy things are. Happy little query plans.

Sargy Bargy

You’re smart people. You’re on top of things. You know that without an index on PostId, the query up above wouldn’t have anything to seek to.

Useful indexes are half the battle. The other half of the battle is not screwing things up.

I’m going to use dynamic SQL as shorthand for any parameterized query. I should probably add that using a local variable would only make things worse.

Don’t do that. Or this.

DECLARE
    @sql nvarchar(MAX) = N'',
    @PostId int = 194812;

SELECT
    @sql = N'
SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE (v.PostId = @PostId 
OR     @PostId IS NULL);';

EXEC sys.sp_executesql
    @sql,
    N'@PostId int',
    @PostId;

While we’re on the topic, don’t do this either.

DECLARE
    @sql nvarchar(MAX) = N'',
    @PostId int = 194812;

SELECT
    @sql = N'
SELECT
    c = COUNT_BIG(*)
FROM dbo.Votes AS v
WHERE v.PostId = ISNULL(@PostId, v.PostId);';

EXEC sys.sp_executesql
    @sql,
    N'@PostId int',
    @PostId;

Here’s the query plans for these:

We end up not only scanning the entire index unnecessarily, but the second one gets a really unfavorable cardinality estimate.

It’s amazing how easy it is to ruin a perfectly good seek with lazy query writing, isn’t it?

Joint Venture

When you compare two columns in tables, you might not always see a seek, even if you have the best index ever.

Let’s use this as an example:

CREATE INDEX p ON dbo.Posts(OwnerUserId, LastEditorUserId)
WITH(SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

SELECT
    c = COUNT_BIG(*)
FROM dbo.Posts AS p
WHERE p.OwnerUserId = p.LastEditorUserId;

We have an index that matches our predicate, and you might think that having all that data very nicely in order would make SQL Server’s job really easy to match those columns up.

But no. No in every language.

Big ol’ index scan. Even if you try to force the matter, SQL Server says no nein nyet non and all the rest.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Posts AS p WITH(FORCESEEK)
WHERE p.OwnerUserId = p.LastEditorUserId;

Msg 8622, Level 16, State 1, Line 56
Query processor could not produce a query plan because of the hints defined in this query. 
Resubmit the query without specifying any hints and without using SET FORCEPLAN.

Who can deal with all that?

Hail Mary

Let’s really dig our heels in and try to make this work. We’ll create an index on both columns individually and see how things go.

CREATE INDEX p ON dbo.Posts(OwnerUserId)
WITH(SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX pp ON dbo.Posts(LastEditorUserId)
WITH(SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Just to be clear, this is horrible. You don’t want this to happen. This sucks, and if you like it you should jump in a deep, deep hole.

Ends Well

Seeks are often the best possible outcome when a small number of rows are sought. OLTP workloads are the prime candidate for seeking seeks. A seek that reads a large portion of the table isn’t necessarily agreeable.

For everyone else, there’s nothing wrong with scans. Especially with column store indexes, you shouldn’t expect seeking to a gosh darn thing.

Thanks for reading!

Going Further

Understand Your Plan: Reading Operator Times In SQL Server Execution Plans

Posted on February 28, 2022May 16, 2022 by Erik Darling

Wiseacre

When I was sitting down to figure out what to write about this, one of the topics that I thought would be interesting was the width of lines/arrows in query plans.

I sat here for half a day pecking here and there getting the post started, writing some demos, and… realized I couldn’t remember the last time I really looked at them. I could explain it, but… I’d be lying to you if I said I really cared.

Unless you have an actual execution plan, line thickness is based on estimates. They could be totally wrong, and chances are that if you’ve got a query plan in front of you it’s not because it’s running well.

If you’re going to run it and look at the actual execution plan anyway, you’d have to be seven levels obtuse to look at line widths instead of operator times.

They’re your best friends these days, and help you focus on the parts of the plan that need help.

Looney Tunes

At this point in 2022, most people are going to be looking at plans running only in row mode. As folks out there start to embrace:

Column store
SQL Server 2019 Enterprise Edition
Compatibility level 150+

We’ll start to see more batch mode operators, especially from batch mode on row store. I only mention this because row mode and batch mode operators track time differently, and you need to be really careful when analyzing operator times in actual execution plans.

In plans that contain a mix of row mode and batch mode operators, timing might look really funny in some places.

Let’s chat about that!

Row Mode

As of this writing, all row mode query plan operators accumulate time, going from right to left. That means each operator tracks it’s own time, along with all of the child operators under it.

Where this is useful is for following time accumulation in a query plan to where it spikes. I sort of like this because it makes tracking things down a bit easier than the per-operator times in batch mode plans.

Let’s use this plan as an example:

Yes, it’s intentionally bad. I made it that way. You’re welcome. There are two things I’d like to point out here:

Just about every line looks equally thick
Costing is a bit weird, aside from the Eager Index Spool

But you know what? If you look at operator times, you can get a pretty good idea about where things went wrong.

Maybe that Key Lookup wasn’t the greatest use of time
Boy howdy, that eager index spool took about 90 seconds

I’m not gonna go into how to fix this, I just want you to understand where time is taken, and how it adds up across operators.

Batch Mode

In batch mode plans, the story is a whole lot different.

Times for operators are all over the place. They’re not wrong, they’re just different. Each operator that executes in batch mode tracks it’s own time, without the child operator times.

Nothing adds up until the very end, when you hit the Gather Streams operator. Right now, none of the parallel exchange operators can run in batch mode, so they’ll add up all of the child operator times.

If you’ve got a plan with a lot of operators mixed between row and batch mode, things will look even weirder.

Thanks for reading!

Going Further

Getting The Top Value Per Group With Multiple Conditions In SQL Server: Row Number vs. Cross Apply With MAX

Posted on February 24, 2022May 16, 2022 by Erik Darling

Throat Music

In yesterday’s post, we compared a simple situation trying to find the post scoring post for each user.

In today’s post, we’re going to add another condition: we want the highest scoring post for each type of post someone has made.

Now look, most people don’t get involved with any of these things, but whatever. It just poses an interesting and slightly more complicated problem.

Slightly Different Index

Since we’re going to be using PostTypeId in the window function, we need it in the key of our index:

CREATE INDEX p ON dbo.Posts(OwnerUserId, PostTypeId, Score DESC)
WITH(SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Now our query looks like this:

SELECT
    u.DisplayName,
    u.Reputation,
    p.PostTypeId,
    p.Score
FROM dbo.Users AS u
JOIN
(
    SELECT
        p.*,
        n = 
            ROW_NUMBER() OVER
            (
                PARTITION BY
                    p.OwnerUserId,
                    p.PostTypeId --This is new!
                ORDER BY
                    p.Score DESC
            )
    FROM dbo.Posts AS p
) AS p
    ON  p.OwnerUserId = u.Id
    AND p.n = 1
WHERE u.Reputation > 50000
ORDER BY 
    u.Reputation DESC,
    p.Score DESC;

S’good? S’good. Let’s go.

Row Number Query Plan

Similar to yesterday’s plan, this one is rather slow, rather serial, and generally not how we want to be spending our precious time.

Let’s look at the apply method, because we have to change our query a little bit to accomplish the same thing.

Cross Apply With MAX

Rather than go with TOP (1), we’re going to GROUP BY OwnerUserId and PostTypeId, and get the MAX(Score).

SELECT
    u.DisplayName,
    u.Reputation,
    p.PostTypeId,
    p.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT 
        p.OwnerUserId,
        p.PostTypeId,
        Score = MAX(p.Score)
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id
    GROUP BY 
        p.OwnerUserId, 
        p.PostTypeId
) AS p
WHERE u.Reputation > 50000
ORDER BY 
    u.Reputation DESC,
    p.Score DESC;

This will give us the same results, but a lot faster. Again.

Cross Apply Query Plan

Like I was saying…

Down to ~400ms now. Not bad, right?

Lower Selectivity

If we take those same queries and lower the reputation filter from 50,000 to 1, some interesting changes to the query plans happen.

Repartition streams and I have had some problems in the past. It’s not necessarily “to blame”, it just has a tough time with some data distributions, especially, it seems, when it’s order preserving.

The cross apply with aggregation works really well. It’s kinda neat that both queries get slower by the same amount of time, but the ROW_NUMBER query is still much, much slower.

All of this is interesting and all, but you know what? We haven’t look at batch mode. Batch mode fixes everything.

Sort of. Don’t quote me on that. It’s just really helpful in the kind of BIG QUERY tuning that I end up doing.

Batch Mode

This is the only thing that makes the ROW_NUMBER query competitive in this scenario, owing to the fact that batch mode often removes Repartition Streams, and we’re eligible for the Window Aggregate operator.

I’m dumping these results to #temp tables because I don’t want to wait for SSMS to render the large result set, but you can still see the positive overall effect.

The poorly performing ROW_NUMBER query is now very competitive with the CROSS APPLY query.

Thanks for reading!

Going Further

Getting The Top Value Per Group In SQL Server: Row Number vs. Cross Apply Performance

Posted on February 23, 2022May 16, 2022 by Erik Darling

Introductions

There are many ways to express queries in SQL. How different rewrites perform will largely be a function of:

You not doing anything ridiculous
Queries having good indexes in place
The optimizer not seeing through your tricks and giving you the same query plan

The first rule of rewrites is that they have to produce the same results, of course. Logical equivalency is tough.

In today and tomorrow’s posts I’m going to compare a couple different scenarios to get the top value.

There are additional ways to rewrite queries like this, of course, but I’m going to show you the most common anti-pattern I see, and the most common solution that tends to work better.

Right And Proper Indexing

For today’s post, we’re going to use this index:

CREATE INDEX p ON dbo.Posts(OwnerUserId, Score DESC) INCLUDE(PostTypeId)
WITH(SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Which is going to give this query proper support. Sure, we could also add an index to the Users table, but the one scan is trivially fast, and probably not worth it here.

SELECT
    u.DisplayName,
    u.Reputation,
    p.PostTypeId,
    p.Score
FROM dbo.Users AS u
JOIN
(
    SELECT
        p.*,
        n = 
            ROW_NUMBER() OVER
            (
                PARTITION BY
                    p.OwnerUserId
                ORDER BY
                    p.Score DESC
            )
    FROM dbo.Posts AS p
) AS p
    ON  p.OwnerUserId = u.Id
    AND p.n = 1
WHERE u.Reputation > 50000
ORDER BY 
    u.Reputation DESC,
    p.Score DESC;

The general idea is to find all users with a reputation over 50,000, along with their highest scoring post.

I know, you’re looking at this and thinking “jeez Erik, why are you selecting * here? don’t you know how bad and dumb you are for that?”

Well, SQL Server is smart enough to ignore that and only deal with the columns in the outer select.

The Query Plan

If you create the index and run this, the query plan looks something like this:

The majority of the time spent in this plan is the ~11 seconds between the scan of the Posts table and the Filter operator.

The filter is there to remove rows where the result of the ROW_NUMBER function are greater than 1.

I guess you could say less than 1, too, but ROW_NUMBER won’t produce rows with 0 or negative numbers naturally. You have to make that happen by subtracting.

A Better Query Pattern?

Since only ~2400 rows are leaving the Users table, and we have a good index on the Posts table, we want to take advantage of it.

Rather than scan the entire Posts table, generate the ROW_NUMBER, apply the filter, then do the join, we can use CROSS APPLY to push things down to where we touch the Posts table.

SELECT
    u.DisplayName,
    u.Reputation,
    p.PostTypeId,
    p.Score
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT TOP (1)
        p.*
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id
    ORDER BY p.Score DESC
) AS p
WHERE u.Reputation > 50000
ORDER BY 
    u.Reputation DESC,
    p.Score DESC;

This is logically equivalent, but it blows the other query out of the water, performance-wise.

A Better Query Plan?

Here’s the query plan for the cross apply query:

Why Is This better?

In this case, having a good index to use, and a small outer result from the Users table, the cross apply query is way better.

This is also due to the Id column of Users being the Primary Key of the table. For this sort of one to many join, it works beautifully. If it were a many to many scenario, it could be a toss up, or ROW_NUMBER could blow it out of the water.

The way this type of Nested Loops Join works (Apply Nested Loops), is to take each row from the outer input (Users table) and seek to it in the Posts table.

Without that good index up here, this would likely be a disaster with an Eager Index Spool in the plan. We definitely don’t want that, here.

But you know, there are many different types of posts. We might want to know someone’s high score for questions, answers, and more.

In tomorrow’s post, we’ll look at how to do that, and performance tune the query.

Thanks for reading!

Going Further

Software Vendor Mistakes With SQL Server: Using Left Joins To Find Rows That Don’t Exist

Posted on February 3, 2022May 16, 2022 by Erik Darling

Natural Language

In SQL Server’s T-SQL, and in the more general ANSI-Standard SQL, you’re supposed to write queries in a way that mimics how you’d ask the same question — just don’t call it a query — in English. Because of that, there are some ways to phrase a query that are more natural than others.

Some are more intuitive once you get them down, and others can bewitch you for decades. For example, I’m still not always awesome at reasoning out INTERSECT and EXCEPT queries, or even thinking of them first when writing a query where they’d be useful.

Maybe someday.

Dirty Pig Latin

Bad advice exists in many places in the world. I don’t mean to single out the internet, though it certainly has made the publishing and proliferation of bad advice much more accessible.

I do a lot of reading about databases in general, and SQL Server specifically, to see what other folks are out there writing about and teaching others. One of my favorite arrondissements of that world is the stuff that attracts beginners.

After all, that’s when you develop habits, good or bad. Much of this series focuses on the bad habits learned in that time, and how they muck up performance.

One of those bad habits I see over and over again is using LEFT JOINs to find rows that don’t exist. The reason I focus on these is because of the effect they have on query plans, due to the way that queries have to be logically processed.

Let’s look at that, first.

Who’s Not On First?

A quite common looking query for doing this (if you want to do it wrong) would look something like this:

SELECT
   c = COUNT_BIG(*)
FROM dbo.Users AS U
LEFT JOIN dbo.Comments AS C
    ON  C.UserId = U.Id
    AND C.Score > 0
WHERE C.Id IS NULL;

The important part of the query plan is right around here:

If you’re looking extra closely, without any leading or prompting by me whatsoever, you’ll notice that after the join operation bring the two tables we’re querying togethers — Users and Comments — which is expressed as a left outer join of course, then and only then do we filter out rows where the Id column in Comments is NULL.

The problem is that all this is after the join, and in some scenarios this is far less efficient. Both from the perspective that you have to join many more matching rows together, and from the perspective that the optimizer can sometimes have a weird time ordering outer joins, especially when there are a lot of them.

Note that, for various reasons, this query runs for around 4 seconds total.

Who Doesn’t Exist On First?

An often better way of expressing this sort of query is using the — wait for it — natural expression of the Structured Query Language.

SELECT
   c = COUNT_BIG(*)
FROM dbo.Users AS U
WHERE NOT EXISTS
      (
          SELECT
              1/0
          FROM dbo.Comments AS C
          WHERE  C.UserId = U.Id
          AND    C.Score > 0
      );

This query is logically equivalent to the last one. It may even be arguably better looking. More elegant, as the developers say when they want to describe well-formatted code that no one really understands.

The query plan looks like this now:

Now we get this crazy little thing called an Anti Semi Join. That means rows are filtered out at the join rather than at some point later on in an explicit Filter operator.

To highlight things a little further, look at the actual number of rows that pass through the filter in the original query and the join in the second query:

See there? The filter in the LEFT JOIN query reduces he working row set to the same number as the NOT EXISTS query does at the join.

In most cases, you’re better off writing queries this way. It may depend a bit on available indexes, batch mode, and server settings. But writing a blog post that takes every single one of those things into account would likely leave you bored out of your gourd.

Thanks for reading!

Going Further

Software Vendor Mistakes With SQL Server: Using Date Functions On Columns Or Local Variables

Posted on February 1, 2022May 16, 2022 by Erik Darling

Reusable Logic

In most programming languages, it’s quite sensible to create a variable or parameter, use some predefined logic to assign it a value, and then keep reusing it to prevent having to execute the same code over and over again.

But those languages are all procedural, and have a bit of a different set of rules and whatnot. In SQL Server, there are certainly somewhat procedural elements.

Functions
Control-flow logic
Cursors
While loops
Maybe the inner side of Nested Loops joins

You may be able to name some more, if you really get to thinking about it. That should be a common enough list, though.

Reusable Problems

SQL Server has a wonderful optimizer. It’s capable of many things, but it also has some problems.

Many of those problems exist today for “backwards compatibility”. In other words: play legacy games, win legacy performance.

Lots of people have found “workarounds” that rely on exploiting product behavior, and taking that away or changing it would result in… something else.

That’s why so many changes (improvements?) are hidden behind trace flags, compatibility levels, hints, batch mode, and other “fences” that you have to specifically hop to see if the grass is greener.

One of those things is the use of local variables. The linked post details how lousy those can be.

In this post, I show how you’re better off using the date math expressions instead.

And in this post, I show how you’re better off doing date math on parameters rather than on columns.

Let’s bring all that together!

Reusable Solutions

In SQL Server, context is everything. By context, I mean the way different methods of query execution are able to accept arguments from others.

You’ll sometimes hear this referred to as scope, too. Usually people will say inner context/scope and outer context/scope, or something similar.

What that means is something like this, if we’re talking about stored procedures:

CREATE PROCEDURE
    dbo.InnerContext
(
    @StartDate datetime,
    @EndDate datetime
)
AS
BEGIN
SET NOCOUNT, XACT_ABORT ON;

    SELECT
        C.PostId,
    	Score = 
    	    SUM(C.Score)
    FROM dbo.Comments AS C
    JOIN dbo.Votes AS V
        ON C.PostId = V.PostId
    WHERE C.CreationDate >= @StartDate
    AND   c.CreationDate <  @EndDate
    GROUP BY c.PostId;

END;
GO 

CREATE PROCEDURE
    dbo.OuterContext
(
    @StartDate datetime,
    @EndDate datetime
)
AS
BEGIN
SET NOCOUNT, XACT_ABORT ON;

IF @StartDate IS NULL
BEGIN
   SELECT
       @StartDate = GETDATE();
END;

IF @EndDate IS NULL
BEGIN
   SELECT
       @EndDate = DATEADD(DAY, 30, GETDATE());
END;

EXEC dbo.InnerContext
    @StartDate = @StartDate,
    @EndDate = @EndDate;

END;

If you’re okay using dynamic SQL, and really, you should be because it’s awesome when you’re not bad at it, you can do something like this:

CREATE PROCEDURE
    dbo.OuterContext
(
    @StartDate datetime,
    @EndDate datetime
)
AS
BEGIN
SET NOCOUNT, XACT_ABORT ON;

IF @StartDate IS NULL
BEGIN
   SELECT
       @StartDate = GETDATE();
END;

IF @EndDate IS NULL
BEGIN
   SELECT
       @EndDate = DATEADD(DAY, 30, GETDATE());
END;

DECLARE 
    @sql nvarchar(MAX) = N'
    /*dbo.OuterContext*/
    SELECT
        C.PostId,
    	Score = 
    	    SUM(C.Score)
    FROM dbo.Comments AS C
    JOIN dbo.Votes AS V
        ON C.PostId = V.PostId
    WHERE C.CreationDate >= @StartDate
    AND   c.CreationDate <  @EndDate
    GROUP BY c.PostId;		
    ';

EXEC sys.sp_executesql
    @sql,
  N'@StartDate datetime, 
    @EndDate datetime',
    @StartDate,
    @EndDate;

END;

Which will achieve the same thing.

Thanks for reading!

Going Further

Software Vendor Mistakes With SQL Server: Not Getting Parallel Inserts Into #Temp Tables

Posted on January 19, 2022May 16, 2022 by Erik Darling

Do That, But Faster

Serial zones in parallel plans can leave a lot of performance on the table. One of the best ways to see that is with insert queries that do a lot of work. A big contrast between @table variables and #temp tables is that the former fully disallows parallelism in modification queries, and the latter do not.

The story gets a little more complicated when inserting into a columnstore index, but this post isn’t really about that.

Let’s look at some scenarios where a fully parallel insert is allowed, and then not allowed.

The thing to keep an eye out for is the insert operator being in the serial zone. For the purposes of this thread:

Works: SELECT INTO

As long as you don’t do anything too goofy here, a fully parallel insert will “always” be allowed, here.

Goofy things will be explained later in the post.

--This will "always" work, as long as you don't do 
--anything listed below in the "broken" select
SELECT
    C.UserId,
    SumScore = 
        SUM(C.Score)
INTO
    #AvgComments_SelectInto
FROM 
    dbo.Comments AS C
GROUP BY 
    C.UserId
HAVING 
    SUM(C.Score) > 200;

DROP TABLE #AvgComments_SelectInto;

Works: INSERT, with TABLOCK

Rather than selecting directly into a table, here we’re gonna create the table and issue an insert statement with the tablock hint.

--This *will* get you a fully parallel insert, unless goofiness is involved.
CREATE TABLE
    #AvgComments_Tablock
(
    UserId int,
    SumScore int
);

INSERT 
    #AvgComments_Tablock WITH (TABLOCK)
(
    UserId, 
    SumScore
)
SELECT 
    C.UserId,
    AvgScore = 
        SUM(C.Score)
FROM 
    dbo.Comments AS C
GROUP BY 
    C.UserId
HAVING 
    SUM(C.Score) > 200

DROP TABLE #AvgComments_Tablock

Doesn’t Work: INSERT, without TABLOCK

Without the tablock hint, this will get you the plan we don’t want, where the insert operator is outside the parallel zone.

--This will not get you a fully parallel insert
CREATE TABLE
    #AvgComments_NoTablock
(
    UserId int,
    SumScore int
);

INSERT 
    #AvgComments_NoTablock 
(
    UserId, 
    SumScore
)
SELECT
    C.UserId,
    SumScore = 
        SUM(C.Score)
FROM 
    dbo.Comments AS C
GROUP BY 
    C.UserId
HAVING 
    SUM(C.Score) > 200;

DROP TABLE #AvgComments_NoTablock;

Doesn’t Work: A Whole Laundry List Of Stuff

Basically any one thing quoted out has the ability to deny the parallel insert that we’re after.

If you’re doing any of this stuff, like, bye.

--SET ROWCOUNT Any_Number;
--ALTER DATABASE StackOverflow2013 
--    SET COMPATIBILITY_LEVEL = Anything_Less_Than_130;
CREATE TABLE
    #AvgComments_BrokenTablock
(
    --Id int IDENTITY,
    UserId int,
    SumScore int,
    --INDEX c CLUSTERED(UserId)
    --INDEX n NONCLUSTERED(UserId)
);

--Also, if there's a trigger or indexed view on the target table
--But that's not gonna be the case with #temp tables
INSERT 
    #AvgComments_BrokenTablock WITH (TABLOCK)
(
    UserId, 
    SumScore
)
--The rules here are a little weird, so
--be prepared to see weird things if you use OUTPUT
--OUTPUT Inserted.*
--To the client or
--INTO dbo.some_table
--INTO @table_varible
SELECT
    --Id = IDENTITY(bigint, 1, 1),
    --dbo.A_Scalar_UDF_Not_Inlined_By_Froid
    C.UserId,
    SumScore = 
        SUM(C.Score)
FROM 
    dbo.Comments AS C
--Any reference to the table you're inserting into
--Not exists is just an easy example of that
--WHERE NOT EXISTS
--(
--    SELECT
--	    1/0
--	FROM #AvgComments_BrokenTablock AS A
--	WHERE A.UserId = C.UserId
--)
GROUP BY 
    C.UserId
HAVING 
    SUM(C.Score) > 200;

DROP TABLE #AvgComments_BrokenTablock;

Explainer

There are many good reasons to want a fully parallel insert, but you need to make sure that the bottleneck isn’t earlier in the plan.

If it is, you may not see the full performance gains from getting it.

In general, it’s a good strategy when building larger #temp tables, but at this point I add a tablock hint to every #temp table insert at first to test things out.

Thanks for reading!

Going Further

Performance Issues With NOT EXISTS Queries In SQL Server

Posted on November 15, 2021May 16, 2022 by Erik Darling

Born Day

I’m going to start this post off sort of like Friday’s post:

Look, I really like EXISTS and NOT EXISTS. I do. They solve a lot of problems.

This post isn’t a criticism of them at all, nor do I want you to stop using them. I would encourage you to use them more, probably.

If you keep your head about you, you’ll do just fine.

The difference here is specific to NOT EXISTS, though, and it has to do with join reordering.

Or rather, the lack of join reordering.

Let’s get after it.

Happy Kids

When we write our query like so, things are fine.

The Users and Badges tables are relatively small, and a parallel hash join query makes short work of the situation.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
WHERE u.Reputation > 1000
AND   
(
    NOT EXISTS
        (
            SELECT
                1/0
            FROM dbo.Badges AS b
            WHERE b.UserId = u.Id
        )
    AND  NOT EXISTS
             (
                 SELECT
                     1/0
                 FROM dbo.Comments AS c
                 WHERE c.UserId = u.Id
             )      
);

This query finishes in a shart under a second.

Notice that since no rows pass the first join, the Comments table is left unscathed.

BUT THEN™

Bad Times

If we write the query like this, the optimizer leaves things alone, and we get a much worse-performing query.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
WHERE u.Reputation > 1000
AND   
(
    NOT EXISTS
        (
            SELECT
                1/0
            FROM dbo.Comments AS c
            WHERE c.UserId = u.Id
        )   
    AND NOT EXISTS
            (
                SELECT
                    1/0
                FROM dbo.Badges AS b
                WHERE b.UserId = u.Id
            )   
);

This one clocks in around 6 seconds, and complains of an excessive memory grant.

The big time suck here is spent hitting the Comments table, which is significantly larger than the Badges table.

Totally Wired

The order that you write joins and where clause elements in generally doesn’t matter much, but in the case of NOT EXISTS, it can make a huge difference.

I realize that there are only two NOT EXISTS clauses in these examples, and that hardly makes for a compelling “always” statement. But I did a lot of experimenting with more tables involved, and it really doesn’t seem like the optimizer does any reordering of anti-semi joins.

Thanks for reading!

Going Further

Performance Issues With EXISTS Queries In SQL Server

Posted on November 12, 2021May 16, 2022 by Erik Darling

Dos Puntos

Look, I really like EXISTS and NOT EXISTS. I do. They solve a lot of problems.

This post isn’t a criticism of them at all, nor do I want you to stop using them. I would encourage you to use them more, probably.

But there’s some stuff you need to be aware of when you use them, whether it’s in control-flow logic, or in queries.

If you keep your head about you, you’ll do just fine.

IF EXISTS

The issue you can hit here is one of row goals. And a T-SQL implementation shortcoming.

If I run this query, it’ll chug along for about 10 seconds.

IF EXISTS
(
    SELECT 
       1/0
    FROM dbo.Posts AS p
    JOIN dbo.Votes AS v
        ON  p.Id = v.PostId
    WHERE v.VoteTypeId = 1
    AND   v.CreationDate >= '2018-12-01'
    AND   p.PostTypeId = 1
)
BEGIN
    SELECT x = 1;
END;

The part of the plan that we care about is a seek into the Votes table.

SQL SERVER’S COST BASED OPTIMIZER™ thinks that 2.52 (rounded to 3) rows will have to get read to find data we care about, but it ends up having to do way more work than that.

It’s worth a short topic detour here to point out that when you’re tuning a slow query, paying attention to operator costs can be a real bad time. The reason this query is slow is because the costing was wrong and it shows. Costed correctly, you would not get this plan. You would not spend the majority of the query execution time executes in the lowest-costed-non-zero operator.

Normally, you could explore query hints to figure out why this plan was chosen, but you can’t do that in the context of an IF branch. That sucks, because a Hash Join hinted query finished in about 400ms. We could solve a problem with that hint, or if we disabled row goals for the query.

Fixing It

In order to tune this, we need to toggle with the logic a little bit. Rather than put a query in the IF EXISTS, we’re going to set a variable based on the query, and use the IF logic on that, instead.

DECLARE
    @do_it bit;

SELECT
    @do_it = 
    (
        SELECT 
            CONVERT
            (
                bit,
                ISNULL
                (
                    MAX(1), 
                	0
                )
            )
        FROM dbo.Posts AS p
        JOIN dbo.Votes AS v
            ON  p.Id = v.PostId
        WHERE v.VoteTypeId = 1
        AND   v.CreationDate >= '2018-12-01'
        AND   p.PostTypeId = 1
    )
OPTION(HASH JOIN);

IF @do_it = 1
BEGIN
    SELECT x = 1;
END;

This produces the fast plan that we’re after. You can’t use a CASE expression here and get a hash join though, for reasons explained in this post by Pablo Blanco.

But here it is. Beautiful hash join.

EXISTS With OR Predicates

A common query pattern is to is EXISTS… OR EXISTS to sort out different things, but you can end up with a weird optimizer query rewrite (SplitSemiApplyUnionAll) that looks a lot like the LEFT JOIN… IS NULL pattern for finding rows that don’t exist. Which is generally a bad pattern, as discussed in the linked post.

Anyhoo.

SELECT
    c = COUNT_BIG(*)
FROM dbo.Users AS u
WHERE u.Reputation > 1000000
AND   EXISTS
      (
          SELECT
              1/0
          FROM dbo.Badges AS b
          WHERE b.UserId = u.Id
      )
OR    EXISTS
      (
          SELECT
              1/0
          FROM dbo.Comments AS c
          WHERE c.UserId = u.Id
      );

This is what I’m talking about, in the plan for this query.

Rather than do two semi joins here for the EXISTS, we get two right outer joins. That means (like in the linked post above), all rows between tables are joined, and filters are applied much later on in the plan. You can see one of the right outer joins, along with the filters (on expressions!) in the nice picture up there.

Fixing It

The fix here, of course (of course!) is to write the query in a way that the optimizer can’t apply that foolishness to.

SELECT
    c = SUM(x.c)
FROM 
(
    SELECT
        c = COUNT_BIG(*)
    FROM dbo.Users AS u
    WHERE u.Reputation > 1000000
    AND   EXISTS
          (
              SELECT
                  1/0
              FROM dbo.Badges AS b
              WHERE b.UserId = u.Id
          )
    
    UNION ALL
    
    SELECT
        c = COUNT_BIG(*)
    FROM dbo.Users AS u
    WHERE EXISTS
          (
              SELECT
                  1/0
              FROM dbo.Comments AS c
              WHERE c.UserId = u.Id
          )
) AS x;

This query completes in around 1.5 seconds, compared to 4.9 seconds above.

Seasoned Veteran

It’s rough when you run into these problems, because solutions aren’t always obvious (obvious!), nor is the problem.

Most of the posts I write about query tuning arise from issues I solve for clients. While most performance problems come from predictable places, sometimes you have to deal with edge cases like this, where the optimizer mis-costs things.

Thanks for reading!