Joins With OR Clauses = 💔

A Long Time Ago

I had to write some hand-off training about query tuning when I was starting a new job.

As part of the training, I had to explain why writing “complicated logic” could lead to poor plan choices.

So I did what anyone would do: I found a picture of a pirate, named him Captain Or, and told the story of how he got Oared to death for giving confusing ORders.

This is something that I unfortunately still see people doing quite a bit, and then throwing their hands up as queries run forever.

I’m going to show you a simple example of when this can go wrong, and also beg and plead for the optimizer team to do something about it.

Big Bully

“Write the query in the simplest way possible”, they said.

So we did, and we got this.

Note the OR in the join condition — we can match on either of those columns.

Here’s the index we created to make this SUPERFAST.

If we’re good DBAs, still doing as we’re told, we’ll read the query plan from right to left.

The first section we’re greeted with is this:

No Cardio

We spend a full minute organizing and ordering data. If you want to poke around, the plan XML is here.

The columns in the Compute Scalars are OwnerUserId and LastEditorUserId.

Next in the plan is this fresh hell:

Squozed

24 seconds seeking into the Users table and joining that to the results of the Constant Scans, etc.

What’s a little confusing here is that the scan on the Posts table occurs on the outer side of Nested Loops.

It’s also responsible for feeding rows through the Constant Scans. That’s their data source.

Overall, this query takes 1:36 seconds to run.

My gripe with it is that it’s possible to rewrite this query in an obvious way to fix the problem.

Magick

Using a second join to Posts clears things up quite a bit.

I know, it probably sounds counterintuitive to touch a table twice.

Someone will scream that we’re doing more reads.

Someone else will faint at all the extra code we wrote.

But when we run this query, it finishes in 10 seconds.

That’s better!

This plan does something a bit different. It joins the nonclustered index we have on Posts to itself.

Seminar

The optimizer has a rule that makes this possible, called Index Intersection.

Extra Magick

A more accurate description of what I’d want the optimizer to consider here would be the plan we get when we rewrite the query like this.

Mush.

This query runs a bit faster than the second one (around 7 seconds), and the plan is a little different.

Rather than a Hash Join between the index on the Posts table, we have a Concatenation operator.

The rest of the plan looks like this:

Yabadabadoo!

The optimizer has a rule that can produce this plan, too, called Index Union.

Problemagick

The thing is, these rules seem to be favored more with WHERE clauses than with JOINs.

Index Union

Index Intersection

Knackered

It is possible to get these kind of plans with joins, but not without join hints and a couple indexes.

There’s more background from, of course, Paul White, here and here.

Even with Paul White ~*~Magick~*~, the hinted query runs for ~16 seconds.

If you remember, the Index Intersection plan ran for around 10 seconds, and the Index Union plan ran for around 7 seconds.

2slow

This plan uses Index Union:

Spillzo

Thanks for reading!

4 thoughts on “Joins With OR Clauses = 💔”

  1. Hi Erik,

    I’m not sure the query in the first Magick section is correct? Wouldn’t having two inner joins to the posts table only return records where the user Id is in both OwnerUserId and LastEditorUserId fields?

    Re-writing it to use two left outer joins with a COALESCE() call brings back the expected results on my system

    SELECT u.Id,
    /*because the score could come from either post table join, use a COALESCE() function*/
    MAX(COALESCE(p.Score, p2.score))
    FROM dbo.Users AS u
    LEFT JOIN dbo.Posts AS p
    ON u.Id = p.OwnerUserId
    AND p.PostTypeId IN (1, 2)
    LEFT JOIN dbo.Posts AS p2
    ON u.Id = p2.LastEditorUserId
    AND p2.PostTypeId IN (1, 2)
    GROUP BY u.Id;

    Or, I could be reading it wrong, it’s first thing in the morning this side of the pond

    Keep up the good work

  2. Hi,

    Have thought about it some more, and the solution in the first Magick section only works for users that have an entry in both OwnerUserId and LastEditorUserId.

    I don’t have SO database to test whether this is always true for this particular dataset, but conceivably a user could have a single post that they are the owner of, but that was last edited by someone else, in which case one of the two inner joins wouldn’t return anything so they would be excluded from the resultset.

    I think, could still be barking up the wrong tree

Leave a Reply

Your email address will not be published. Required fields are marked *