Statistics – Page 2 – Darling Data

Query Tuning SQL Server 2019 Part 1: Changing Databases

Posted on June 15, 2020May 14, 2022 by Erik Darling

Teeth To Grit

I’ve always had trouble standing still on SQL Server versions, but most companies don’t. Hardly anyone I talk to is on SQL Server 2017, though these days SQL Server 2016 seems more common than SQL Server 2012, so at least there’s that. Mostly I’m happy to not see SQL Server 2014. God I hate SQL Server 2014.

Despite the lack of adoption, I’ve been moving all my training material to SQL Server 2019. Heck, in a few years, my old posts might come in handy for you.

But during that process, I kept running into the same problem: The demos generally still worked for the OLTP-ish queries, but for the report-ish queries Batch Mode On Rowstore (BMOR, from here) was kicking butt (most of the time anyway, we’re gonna look at some hi-jinks this week).

The problem, so far as I could tell, was that the Stack Overflow 2013 database just wasn’t enough database for SQL Server 2019 (at least with my hardware). My laptop is quad core (8 with HT) @2.9GHz, with 64GB of RAM, and max server memory set to 50GB. The SO2013 database is… just about 50GB.

While it’s fun to be able to create performance problems even with the whole database in memory, it doesn’t match what lot of people are dealing with in real life.

Especially you poor saps on Standard Edition.

My options seemed to be:

Drop max server memory down
Use a VM with lower memory
Use the full size Stack Overflow database

Flipping and Flopping

Each of these has problems, though.

Dropping max server memory down is okay for the buffer pool, but SQL Server (it seems especially with column store/batch mode) is keen to use memory above that for other things like memory grants.

A lot of the interesting struggle I see on client servers between the buffer pool and query memory grants didn’t happen when I did that.

Using a VM with lower memory, while convenient, just didn’t seem as fun. Plus, part of the problem is that, while I make fun of other sample databases for being unrealistically tiny, at least they have relatively modern dates in some of them.

I was starting to feel really goofy having time stop on January 31st, 2013.

I suppose I could have updated all the CreationDate columns to modernize things, but who knows what that would have thrown off.

Plus, here’s a dirty little secret: all the date columns that start with “Last” that track stuff like when someone last logged in, or when a post was last active/edited, they don’t stop at 2013-12-31. They extend up to when the database was originally chopped down to size, in 2017 or so. I always found that a bit jarring, and I’d have to go and add time to them, too, to preserve the gaps.

It all starts to feel a bit like revisionist history.

The End Is Thigh

In the end, I settled on using the most recent version available here, but with a couple of the tables I don’t regularly use in demos cut out: PostHistory, and PostLinks. Once you drop those out, a 360GB database drops down to a much more manageable 150Gb or so.

If you’d like to get a copy, here’s the magnet link.

SQL Server Database Properties — Four users, huh?

The nice thing is that the general cadence of the data is the same in many ways and places, so it doesn’t take a lot to adjust demos to work here. Certain Post and Vote Types, User Ids, Reputations, etc. remain skewed, and outliers are easy to find. Plus, at 3:1 data to memory, it’s a lot harder to keep everything safely in the buffer pool.

This does present different challenges, like index create time to set up for things, database distribution, etc.

But if I can give you better demos, that seems worth it.

Plus, I hear everything is in the cloud now anyway.

Alluding To

In the process of taking old demos and seeing how they work with the new database, I discovered some interesting stuff that I want to highlight a little bit. So far as I can tell, they’re not terribly common (yet), but that’s what makes them interesting.

If you’re the kind of person who’s looking forward to SQL Server 2019’s performance features solving some problems for you auto-magick-ally, these may be things you need to watch out for, and depending on your workload they may end up being quite a bit more common than I perceive.

I’m going to be specifically focusing on how BMOR (and to some extent Adaptive Joins) can end up not solving performance issues, and how you may end up having to do some good ol’ fashion query tuning on your own.

In the next post, we’ll look at how one of my favorite demos continues to keep on giving.

Thanks for reading!

Going Further

If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.

How Are Table Variables Different In SQL Server 2019?

Posted on April 3, 2020May 16, 2022 by Erik Darling

WallaWallaBingBang

Going Further

Why Functions Are Bad in SQL Server Where Clauses

Posted on December 5, 2019May 14, 2022 by Erik Darling

Raised Right

It seems like every time I check out a server, the query plans are a nightmare. Users are freaking out, and management’s coffee is more Irish than Colombian.

Many times, the issue is that people are using presentation layer functions for relational processes. The where clause, joins, group by, and order by parts of a query.

Think about built-in string and date functions, wrapped around columns, and the problems they can cause.

These are things you should actively be targeting in existing code, and fighting to keep out of new code.

Nooptional

When you’re trying to get rid of them, remember your better options

Cleaning data on input, or via triggers: Better than wrapping everything in RTRIM/LTRIM
Using computed columns: Better than relying on runtime calculations like DATEADD/DATEDIFF
Breaking queries up: Use UNION ALL to query for either outcome (think ISNULL)
Using indexed views: If you need to calculate things in columns across tables
Creating reporting tables: Sometimes it’s easier to denormalize a bit to make writing and indexing easier
Using #temp tables: If you have data that you need to persist a calculation in and the query to generate it is complicated

Note the things I’m not suggesting here:

CTEs: Don’t materialize anything
@table variables: Cause more problems than they solve
Views: Don’t materialize unless indexed
Functions: Just no, thanks

More Work

Yes, finding and fixing this stuff is more work for you. But it’s a whole lot less work for the optimizer, and your server, when you’re done.

If that’s the kind of thing you need help with, drop me a line.

Thanks for reading!

Going Further

How SQL Server Statistics Get Updated Automatically

Posted on September 24, 2019May 16, 2022 by Erik Darling

Spawning Monsters

Here we go again, with me promising to blog about something later.

This time it’s an attempt to explain how SQL Server chooses which statistics to update.

It’s not glamorous, and it may even make you angry, but you know.

They can’t all be posts about…

*checks notes*

*stares into the camera*

*tears up notes*

*tears up*

*stares off camera until someone cuts to commercials*

And We’re Back

Let’s start with the query we’re going to use to examine our statistics.

    SELECT      t.name, 
	            s.name, 
				s.stats_id,
				sp.last_updated, 
				sp.rows, 
				sp.rows_sampled, 
				sp.modification_counter
    FROM        sys.stats AS s
    JOIN        sys.tables AS t
        ON s.object_id = t.object_id
    CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) AS sp
    WHERE       t.name = 'UserStats';

Right now, the results aren’t too interesting, because we only have a statistics object attached to the Primary Key.

We’re not gonna touch that column. We’re gonna use another column.

This query will get system generated statistics created on the AccountId column.

    SELECT COUNT(*)
    FROM   dbo.UserStats AS u
    WHERE  u.AccountId > 1000 
    AND    u.AccountId < 9999
	OPTION(RECOMPILE);

SQL Server Statistics — How nice of you to ask.

By itself, this isn’t very interesting. Let’s create an index, too.

    CREATE INDEX ix_AccountId ON dbo.UserStats ( AccountId );

The index created statistics, too. With the equivalent of a full scan! See that rows_sampled column?

I mean, why not, if you’re already scanning the whole table to get the data you need for the index, right?

Right.

I’m gonna use a couple updates to flip values around.

	UPDATE u
	SET u.AccountId = u.UpVotes + u.DownVotes
	FROM dbo.UserStats AS u
	WHERE 1 = 1;
	
	UPDATE u
	SET u.AccountId = u.UpVotes - u.DownVotes
	FROM dbo.UserStats AS u
	WHERE 1 = 1;

Don’t ask me why I swallowed a fly.

But the WHERE 1 = 1 is enough to get SQL Prompt to not warn me about running an update with no where clause.

Both stats objects have been modified the same number of times.

Let’s run our COUNT query and see what happens!

We can see that only the stats for the index were updated (and with the default sampling rate, not a full scan).

Now let’s create another stats object with FULLSCAN.

    CREATE STATISTICS s_AccountId ON dbo.UserStats ( AccountId ) WITH FULLSCAN;

We’ll also go ahead and run an update again.

And then our COUNT query…

SQL Server took two perfectly good fully sampled statistics and reduced them to the default sampling.

This doesn’t hurt our query, but it certainly is annoying to see.

That’s why newer versions of SQL Server allow you to persist the sampling rate.

Latest and Greatest

A lot of the stuff people call “rocket science” about statistics options, like auto create and auto update stats, are there for a reason.

When you let SQL Server make choices, they’re not always the best ones.

Tracking this stuff down and understanding when and if it’s a problem is hard work, though. Don’t flip those switches lightly, my friends.

Thanks for reading!

Going Further

Why Does My Serial Query Plan Have Parallel Wait Stats Like CXPACKET And CXCONSUMER?

Posted on September 23, 2019May 16, 2022 by Erik Darling

Harkening

In dramatic fashion, I’m revisiting something from this post about stats updates.

It’s a quick post, because uh… Well. Pick a reason.

Get In Gear

Follow along as I repeat all the steps in the linked post to:

Load > 2 billion rows into a table
Create a stats object on every column
Load enough new data to trigger a stats refresh
Query the table to trigger the stats refresh

Except this time, I’m adding a mAxDoP 1 hint to it:

SELECT COUNT(*)
FROM dbo.Vetos
WHERE UserId = 138
AND   PostId = 138
AND   BountyAmount = 138
AND   VoteTypeId = 138
AND   CreationDate = 138
OPTION(MAXDOP 1);

Here’s Where Things Get Interesting

Our MaXdOp 1 query registers nearly the same amount of time on stats updates and parallelism.

SQL Server Query Plan — If this is madness…

But our plan is indeed serial. Because we told it to be.

By setting maxDOP to 1.

Not Alone

So, if you’re out there in the world wondering why this crazy kinda thing goes down, here’s one explanation.

Are there others? Probably.

But you’ll have to find out by setting MAXdop to 1 on your own.

Thanks for reading!