What SQL Server Indexes Don’t Do With Data

You Fine And All

You can read a lot about how indexes might improve queries.

But there are certain query mechanics that indexes don’t solve for, and you can not only get stuck with an index that isn’t really helping, but all the same query problems you were looking to solve.

Understanding how indexes and queries interact together is a fundamental part of query tuning.

In this post, we’re going to look at some query patterns that indexes don’t do much to fix.

Part Of The Problem

Indexes don’t care about your queries. They care about storing data in order.

There are very definite ways that you can encourage queries to use them efficiently, and probably even more definite ways for you to discourage them from using them efficiently.

Even with syntax that seems completely transparent, you can get into trouble.

Take the simple example of two date columns: there’s nothing in your index that tracks how two dates in a row relate to each other.

Nothing tracks how many years, months, weeks, days, hours, minutes, seconds, milliseconds, or whatever magical new unit Extended Events has to make itself less usable.

Heck, it doesn’t even track if one is greater or less than another.

Soggy Flautas

Ah, heck, let’s stick with Stack Overflow. Let’s even create an index.

CREATE INDEX whatever 
ON dbo.Posts(CreationDate, ClosedDate);

Now let’s look at these super important queries.

    SELECT   COUNT(*) AS records
    FROM     dbo.Posts AS p
    WHERE    p.CreationDate < p.ClosedDate;


    SELECT   COUNT(*) AS records
    FROM     dbo.Posts AS p
    WHERE    p.CreationDate > p.ClosedDate;

How much work will the optimizer think it has to do, here? How many rows will it estimate? How will it treat different queries, well, differently? If you said “it won’t”, you’re a smart cookie. Aside from the “Actual rows”, each plan has the same attributes across.

A SQL Server query plan — And I’m Homosapien like you

Treachery

Neither of these query plans is terrible on its own.

The problem is really in larger plans, where bad decisions like these have their way with other parts of a query plan.

Nasty little lurkers they are, because you expect things to get better when creating indexes and writing SARGable predicates.

Yet for both queries, SQL Server does the same thing, based on the same guesses on the perceived number of rows at play. It’s one of those quirky things — if it’s a data point we care about, then it’s one we should express somehow.

A computed column might work here:

    ALTER TABLE dbo.Posts 
        ADD created_less_closed AS 
            CONVERT(BIT, CASE WHEN CreationDate < ClosedDate 
                              THEN 1
                              WHEN CreationDate > ClosedDate
                              THEN 0
                         END)


    CREATE INDEX apathy 
        ON dbo.Posts (created_less_closed);

Which means we’d have to change our queries a bit, since expression matching has a hard time reasoning this one out:

    SELECT   COUNT(*) AS records
    FROM     dbo.Posts AS p
    WHERE    1 = CONVERT(BIT, CASE WHEN CreationDate < ClosedDate 
                              THEN 1
                              WHEN CreationDate > ClosedDate
                              THEN 0
                         END)
    AND 1 = (SELECT 1);


    SELECT   COUNT(*) AS records
    FROM     dbo.Posts AS p
    WHERE    0 = CONVERT(BIT, CASE WHEN CreationDate < ClosedDate 
                              THEN 1
                              WHEN CreationDate > ClosedDate
                              THEN 0
                         END)
    AND 1 = (SELECT 1);

What Would Be Better?

Practically speaking, it isn’t the job of an index (or statistic) to track things like this. We need to have data that represents things that are important to users.

Though neither of these plans is terrible in isolation, bad guesses like these flow all through query plans and can lead to other bad decisions and oddities.

It’s one of those terrible lurkers just waiting to turn an otherwise good query into that thing you regard with absolute loathing every time it shows up in [your favorite monitoring tool].

Knowing this, we can start to design our data to better reflect data points we care about. A likely hero here is a computed column to return some value based on which is greater, or a DATEDIFF of the two columns.

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

6 thoughts on “What SQL Server Indexes Don’t Do With Data”

Pingback: What Indexes Don’t Do – Erik Darling Data
Bryan Rebok says:

February 5, 2019 at 6:27 am

Interesting post, Erik. When I was reading it, I also thought of using a computed column with an index. I then wondered if an appropriate check constraint would do something similar:

CONSTRAINT CK_the_dates CHECK (CreationDate ClosedDate. Unfortunately, I couldn’t come up with a way to make the check constraint “short circuit” the query execution and bypass the table/index scan.
1. Erik Darling says:
  
  February 5, 2019 at 8:26 am
  
  Yeah, you could put some more logic in a udf, but then you’ll make everything horrible by putting a udf in your table.
Lori says:

February 7, 2019 at 6:00 am

Great post Erik!

Am I correct that you’re using the ” AND 1 = (SELECT 1);” for forcing a FULL optimization level? I remember to a post writen by you but can’t find it…
1. Erik Darling says:
  
  February 7, 2019 at 8:36 am
  
  Hi Lori!
  
  Yep, exactly right. I’ve mentioned it in passing in a bunch of posts, but never written specifically about it.
  
  Thanks!
  1. Luke says:
    
    August 11, 2019 at 9:34 pm
    
    There was a nice section on it included in this post from Erik back in 2017: https://www.brentozar.com/archive/2017/06/query-plans-trivial-optimization-vs-simple-parameterization/

Comments are closed.