The How To Write SQL Server Queries Correctly Cheat Sheet: IN And NOT IN

Sweetness, Sweetness


I’ll be brief here, and let you know exactly when I’ll use IN and NOT IN rather than anything else:

  • When I have a list of literal values

That’s it. That’s all. If I have to go looking in another table for anything, I use either EXISTS or NOT EXISTS. The syntax just feels better to me, and I don’t have to worry about getting stupid errors about subqueries returning more than one value.

For IN clauses, it’s far less of an ordeal, usually. But for NOT IN, there are some additional concerns around NULLable columns.

Of course, actual NULL values really screw things up, but even when SQL Server needs to protect itself against potential NULL values, you can end up in performance hospice.

Historical


First, a little bit of history. With NOT IN. Take the below script, and flip the insert into the @bad table variable to use 2 instead of NULL, after you’ve seen what happens with NULL.

DECLARE
    @good table 
(
    id int NOT NULL   
);

DECLARE
    @bad table
(
    id int NULL
);

INSERT
    @good
(
    id
)
VALUES
    (1);

INSERT
    @bad
(
    id
)
VALUES
    (NULL); /*Change this between NULL and 2*/

SELECT
    records =
        COUNT_BIG(*) /*Should be 1, or something*/
FROM @good AS g
WHERE g.id NOT IN
(
    SELECT
        b.id
    FROM @bad AS b
);

You’ll see pretty quickly that NOT IN gives you wonky results when it hits a NULL.

This is known.

Imaginary NULLs


Let’s take the below setup. Though each table allows NULLs in their single column, no NULL values will be inserted into them.

CREATE TABLE
    #OldUsers
(
    UserId int NULL
);

CREATE TABLE
    #NewUsers
(
    UserId int NULL
);

/*
But neither one will have any NULL values at all!
*/
INSERT
    #OldUsers WITH (TABLOCK)
(
    UserId
)
SELECT
    p.OwnerUserId
FROM dbo.Posts AS p
WHERE p.OwnerUserId IS NOT NULL;

INSERT
    #NewUsers WITH (TABLOCK)
(
    UserId
)
SELECT
    c.UserId
FROM dbo.Comments AS c
WHERE c.UserId IS NOT NULL;

The real lesson here is that if you know that no NULL values are allowed into your tables, you should specify the columns as NOT NULL.

I know, it’s scary. Really scary. Errors. What if. How dare.

But these are the kind of adult decisions you’ll have to make as an application developer.

Be brave.

Protect Your NULL


The big problem with NOT IN, is that SQL Server goes into defensive driving mode when you use it under NULLable conditions.

I don’t have a license because it would be irresponsible, and I’ve lived my entire life in big cities where having a car would be more trouble than it’s worth. But I assume that when I see people complain about drivers not knowing what to do the second there’s a rain drizzle or a snow flurry somewhere in the area is a similarly annoying scenario out there on the roads. All of a sudden, seemingly competent drivers turn into complete basket cases and drive like they’ve got a trunk full of dead bodies clowns.

Here’s an example of a bad way to deal with the situation, vs. a good way to deal with the situation:

/*Bad Way*/
SELECT
    records = COUNT_BIG(*)
FROM #NewUsers AS nu
WHERE nu.UserId NOT IN
(
    SELECT
        ou.UserId
    FROM #OldUsers AS ou
);

/*Good Way*/
SELECT
    records = 
        COUNT_BIG(*)
FROM #NewUsers AS nu
WHERE NOT EXISTS
(
    SELECT
        1/0
    FROM #OldUsers AS ou
    WHERE nu.UserId = ou.UserId
);

Note the very professional formatting and correct syntax. Ahem. Bask.

Count to 10 while you’re basking.

Results


The resulting execution plan for each example should be illuminating. Here they are:

 

sql server query plan
enabled buttons

The NOT IN version takes 15 minutes, and the NOT EXISTS version takes 415ms.

Reality Bites


Since there are no NULLs, the first query returns the correct results. But the amount of work SQL Server has to do to make sure there are no NULLs is both absurd and preposterous.

If you’re like me, and you want to throw the query optimizer in the garbage every time you see a TOP over a Scan, you might say something like “an index would make this really fast”.

You wouldn’t be wrong, but most people either:

  • Take bad advice and never index #temp tables
  • Create nonclustered indexes on #temp tables that don’t get used

Well, you get what you deserve pay for.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

The How To Write SQL Server Queries Correctly Cheat Sheet: Cross Apply And Outer Apply

Situations


I end up converting a lot of derived joins, particularly those that use windowing functions, to use the apply syntax. Sometimes good indexes are in place to support that, other times they need to be created to avoid an Eager Index Spool.

One of the most common questions I get is when developers should consider using apply over other join syntax.

The short answer is that I start mentally picturing the apply syntax being useful when:

  • There’s a small outer table (FROM), and a large inner table (APPLY)
  • I need to do a significant amount of work on the inner side of a join
  • The goal of the query is top N per group, or something similar
  • I’m trying to get parallel nested loops instead of some alternative plan choice
  • To replace a scalar UDF in the select list with an inline UDF
  • In order to use the VALUES construct in an odd way

Most of this is situational, and requires a bit of practice and familiarity to spot quickly.

Both cross and outer apply can be used in similar ways to subqueries in the select list, with the added bonus that you can return multiple columns and rows with apply, which you can’t do in a normal subquery.

What Apply Does


The way to think about what apply is doing when a query runs is supplying a table-valued result on the inner side of a join for each row supplied by the outer side of a join.

Here’s a simple example:

SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.Title,
    p.Score,
    p.CreationDate,
    p.LastActivityDate
FROM dbo.Users AS u
CROSS APPLY
(
    SELECT
        p.*
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id
    AND   p.PostTypeId = 1
    ORDER BY
        p.CreationDate DESC
    OFFSET 0 ROWS
    FETCH NEXT 3 ROWS ONLY
) AS p
WHERE u.CreationDate >= '20131230'
ORDER BY
    u.Reputation DESC,
    p.CreationDate DESC;

We’re getting everyone from the Users table who Posted a Question in the final days of 2013, ordered by when it was Created.

For every qualifying User, we get a tabular result showing the Title, Score, CreationDate, and LastActivityDate of their question.

You can picture it sort of like this:

sql server query results
tabular

Some Users have have more than three results, and some may have fewer than three results, but since the query is self-limited to only the first three, our query sets a row goal and quits once three are found.

More About The Query


Because the Users table is correlated by Id to the Posts table on OwnerUserId, we need to make sure we have a good index to make that access seekable.

sql server query text
apply-ular

Also because we’re only looking for Questions (PostTypeId = 1), and ordering by the most recent ones (CreationDate DESC), it’s also a wise idea to have those in the key.

It’s also worth talking over an interesting point in the query itself: The select from Posts inside of the apply is doing a select star, sure, but the outer query is only pulling a few of the columns. The optimizer can recognize this, which means we don’t need a gigantic covering index to make this query fast. We also… Don’t really need a covering index at all in this case. Just the key columns are good enough.

CREATE INDEX 
    u 
ON dbo.Users 
    (CreationDate, Reputation, Id) 
INCLUDE 
    (DisplayName) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE); 

CREATE INDEX 
    p 
ON dbo.Posts 
    (OwnerUserId, PostTypeId, CreationDate) 
WITH 
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

Efficiency Unit


An efficient query plan when using apply will look like this:

sql server query plan
hello.

An inefficient query plan using apply will look like this:

sql server query plan
oops

If you’re experimenting with apply, either cross or outer, pay close attention to the query plans. If someone says “cross apply is always slow”, you can bet they stink at indexes.

Choices, Choices


The choice to use apply at all depends on the goal of the query, and the goals of the query tuner. It’s not always a magic performance bullet, but under the right circumstances, it can really make things fly.

The choice to use cross or outer apply depends on the semantics of the starting query. An inner join commutes easily to cross apply, and a left join commutes easily to outer apply.

One important difference in how the joins are implemented is in the optimizer’s choice between normal nested loops, where the join is done at the nested loops operator, and apply nested loops, which is when the join keys are pushed to an index seek on the inner side of the join.

For an in-depth look at that, I’d suggest reading Paul White’s post, Apply versus Nested Loops Join.

The optimizer is capable of transforming an apply to a join and vice versa. It generally tries to rewrite apply to join during initial compilation to maximize the searchable plan space during cost-based optimization. Having transformed an apply to a join early on, it may also consider a transformation back to an apply shape later on to assess the merits of e.g. an index loops join.

Just writing a query using apply doesn’t guarantee that you get the apply nested loops version of a nested loops join. Having solid indexes and easy to search predicates can help push things in the right direction.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

The How To Write SQL Server Queries Correctly Cheat Sheet: Select List Subqueries

Flash Boy


I think subqueries in select lists are very neat things. You can use them to skip a lot of additional join logic, which can have all sorts of strange repercussions on query optimization, particularly if you use have to use left joins to avoid eliminating results.

Subqueries do have their limitations:

  • They can only return one row
  • They can only return one column

But used in the right way, they can be an excellent method to retrieve results without worrying about what kind of join you’re doing, and how the optimizer might try to rearrange it into the mix.

Since subqueries are in the select list, and can’t eliminate rows from results, you will most commonly see them incorporated late in query plans (much further to the left) as outer joins. The optimizer is generally smart enough to retrieve data for select list subqueries after as much other filtering that can have been applied is applied, so they can be evaluated for as few rows in the outer results as possible.

The most important thing you can do as a developer to make sure your select list subqueries are fast is to make sure you have good supporting indexes for them.

Well Done


Let’s start with this query:

SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    TopQuestionScore =     
    (
        SELECT TOP (1)
            p.Score
        FROM dbo.Posts AS p
        WHERE p.PostTypeId = 1
        AND   p.OwnerUserId = u.Id
        ORDER BY
            p.Score DESC,
            p.Id DESC
    ),
    TopAnswerScore =     
    (
        SELECT TOP (1)
            p.Score
        FROM dbo.Posts AS p
        WHERE p.PostTypeId = 2
        AND   p.OwnerUserId = u.Id
        ORDER BY
            p.Score DESC,
            p.Id DESC
    ),
    TotalPosts =    
    (
        SELECT
            COUNT_BIG(*)
        FROM dbo.Posts AS p
        WHERE p.OwnerUserId = u.Id
        AND   p.PostTypeId IN (1, 2)
    )
FROM dbo.Users AS u
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

The goal is to find every User with a Reputation over 500,000, and then find their:

  • Top scoring question (with a unique tie-breaker on most recent post id)
  • Top scoring answer (with a unique tie-breaker on most recent post id)
  • Total questions and answers

You might look at this query with a deep sense of impending dread, wondering why we should make three trips to the Posts table to get this information. I totally get that.

But let’s say we have these indexes in place:

CREATE INDEX 
    u
ON dbo.Users 
    (Reputation, Id)
INCLUDE 
    (DisplayName)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

CREATE INDEX
    p
ON dbo.Posts
    (OwnerUserId, PostTypeId, Score)
WITH
    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);

They have everything we need to support quick, navigational lookups.

Query Planner


The query plan for this arrangement looks like this, finishing in 23 milliseconds.

sql server query plan
all i need

If you write select list subqueries, and they’re terribly slow, there’s a very good chance that the indexes you have in place are not up to the job, particularly if you see Eager Index Spools in the query plan.

All of the time in the plan is spent in the final subquery, that counts the total number of questions and answers. But even that, at 23 milliseconds, is not worth heaving our chests over.

Three round trips are not at all a problem here, but let’s compare.

One Way Ticket


I’m not opposed to experimentation. After all, it’s a great way to learn, observe, and become enraged with the state of query optimization generally.

Here are two rewrites of the above query, to only make one trip to the Posts table. The first one uses a derived join, and the second uses apply. They’re both written to use outer joins, to match the semantics of the three subqueries:

/*
Join
*/
SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.TopQuestionScore,
    p.TopAnswerScore,
    p.TotalPosts
FROM dbo.Users AS u
LEFT JOIN
(
    SELECT
        p.OwnerUserId,
        TopQuestionScore = 
            MAX(CASE WHEN p.PostTypeId = 1 THEN p.Score ELSE NULL END),
        TopAnswerScore = 
            MAX(CASE WHEN p.PostTypeId = 2 THEN p.Score ELSE NULL END),
        TotalPosts = 
            COUNT_BIG(*)
    FROM dbo.Posts AS p
    WHERE p.PostTypeId IN (1, 2)
    GROUP BY
        p.OwnerUserId
) AS p
  ON p.OwnerUserId = u.Id
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

/*
Apply
*/
SELECT
    u.Id,
    u.DisplayName,
    u.Reputation,
    p.TopQuestionScore,
    p.TopAnswerScore,
    p.TotalPosts
FROM dbo.Users AS u
OUTER APPLY
(
    SELECT
        p.OwnerUserId,
        TopQuestionScore = 
            MAX(CASE WHEN p.PostTypeId = 1 THEN p.Score ELSE NULL END),
        TopAnswerScore = 
            MAX(CASE WHEN p.PostTypeId = 2 THEN p.Score ELSE NULL END),
        TotalPosts = 
            COUNT_BIG(*)
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = u.Id 
    AND   p.PostTypeId IN (1, 2)
    GROUP BY
        p.OwnerUserId
) AS p
WHERE u.Reputation > 500000
ORDER BY
    u.Reputation DESC;

A somewhat brief digression here: Query rewrites to use specific syntax arrangements are often not available in ORMs. Many times while working with clients, we’ll stumble across a bushel of quite painful application-generated queries. I’ll show them useful rewrites to improve things, and we’ll all marvel together at how much better things are. I’ll suggest implementing the rewrite as a stored procedure, and all hell will break loose.

Please don’t be one of those developers. Understand the limitations of the technology stack you’re working with. Not everything produced by code is good.

Compare and Contrast


In this case, both of the attempts at rewrites result in identical query plans. The optimizer does a good job here, but both of the single-trip queries is about 2x slower than the original.

In this case, that difference will look and feel microscopic. And it is, mostly because I provided indexes so good that you could write this query any which way and it would work out pretty well.

sql server query plan
back of the van

While one round trip certainly felt more efficient than three, each trip from outer to inner side of the nested loops had a bit more work to do each time, and that added up.

It’s nothing consequential here, but you may run into plenty of situations where it’s far worse (or to be fair, far better).

For me, the real advantage of writing out the three separate subqueries is to better understand which one(s) do the most work, and might need additional work done to make them fast.

When you do everything all at once, you have no idea which piece is responsible for slowdowns. We know from the very first query plan that getting the full count does the most work, but that wouldn’t be obvious to me, you, or anyone else looking at the two query plans in this section.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

Join me In Boston May 10 For A Full Day Of SQL Server Performance Tuning Training

Spring Training


This May, I’ll be presenting my full day training session The Foundations Of SQL Server Performance Tuning.

All attendees will get free access for life to my SQL Server performance tuning training. That’s about 25 hours of great content.

Get your tickets here for this event, taking place Friday, May 10th 2024 at the Microsoft Offices in Burlington.

Here’s what I’ll be presenting:

The Foundations Of SQL Server Performance Tuning

Session Abstract:

Whether you want to be the next great query tuning wizard, or you just need to learn how to start solving tough business problems at work, you need a solid understanding of not only what makes things fast, but also what makes them slow.

I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.

I’m going to crack open my bag of tricks and show you exactly how I find which queries to tune, indexes to add, and changes to make. In this day long session, you’re going to learn about hardware, query rewrites that work, effective index design patterns, and more.

Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.

Event Details:

Get your tickets here for this event!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

bit Obscene Episode 2: SQL Server Agent Jobs

bit Obscene Episode 2: SQL Server Agent Jobs


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

SQL Server Index Design: Getting Key Column Order Right

SQL Server Index Design: Getting Key Column Order Right


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

How To Evaluate Index Effectiveness While Tuning SQL Server Queries

How To Evaluate Index Effectiveness While Tuning SQL Server Queries


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

How To Tell If Lazy Table Spools Are Slowing Your SQL Server Queries Down, And Fix Them

How To Tell If Lazy Table Spools Are Slowing Your SQL Server Queries Down, And Fix Them


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

Join me In Boston May 10 For A Full Day Of SQL Server Performance Tuning Training

Spring Training


This May, I’ll be presenting my full day training session The Foundations Of SQL Server Performance Tuning.

All attendees will get free access for life to my SQL Server performance tuning training. That’s about 25 hours of great content.

Get your tickets here for this event, taking place Friday, May 10th 2024 at the Microsoft Offices in Burlington.

Here’s what I’ll be presenting:

The Foundations Of SQL Server Performance Tuning

Session Abstract:

Whether you want to be the next great query tuning wizard, or you just need to learn how to start solving tough business problems at work, you need a solid understanding of not only what makes things fast, but also what makes them slow.

I work with consulting clients worldwide fixing complex SQL Server performance problems. I want to teach you how to do the same thing using the same troubleshooting tools and techniques I do.

I’m going to crack open my bag of tricks and show you exactly how I find which queries to tune, indexes to add, and changes to make. In this day long session, you’re going to learn about hardware, query rewrites that work, effective index design patterns, and more.

Before you get to the cutting edge, you need to have a good foundation. I’m going to teach you how to find and fix performance problems with confidence.

Event Details:

Get your tickets here for this event!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.

Why Multiple Seeks Are Slower For Some SQL Server Queries

Why Multiple Seeks Are Slower For Some SQL Server Queries


Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that, and need to solve database performance problems quickly. You can also get a quick, low cost health check with no phone time required.