Stack Overflow Demo Database Column Store Edition Introduction

In The Way Back Machine


I really wanted a version of the Stack Overflow data dump that was all clustered column store. In SQL Server 2016, that didn’t really work because of restrictions around MAX data types. In 2017 it did, but… If your data warehouse has a bunch of max data type columns, that’s bad and you should feel bad.

The problem here is that once you drop out the “big” columns (AboutMe from Users, Text from Comments, Body from Posts), the entire ~300GB database compressed down to about 6GB. That means if we want a realistically sized data warehouse, we’d need a “Make Big” script, like people used to use for Adventure Works before it went out of business.

This week I’m going to talk about that process, and share links to download the examples so you can mess with them, or create your own.

Today, I’m going to talk about some of the design considerations, and what the initial setup script does, if you want to build your own version.

Hand In Hand


“Large” column store tables should be partitioned. While no one in their right mind would consider partitioning to be a performance feature with row store indexes, it can be beneficial to column store indexes. One of the first things I create after the standard database script-out is a partition function and scheme.

    CREATE PARTITION FUNCTION pfunc (DATETIME)
    AS RANGE RIGHT FOR VALUES
    (   
	    N'2007-01-01T00:00:00.000',
        N'2008-01-01T00:00:00.000',
        N'2009-01-01T00:00:00.000',
        N'2010-01-01T00:00:00.000',
        N'2011-01-01T00:00:00.000',
        N'2012-01-01T00:00:00.000',
        N'2013-01-01T00:00:00.000',
        N'2014-01-01T00:00:00.000',
        N'2015-01-01T00:00:00.000',
        N'2016-01-01T00:00:00.000',
        N'2017-01-01T00:00:00.000',
        N'2018-01-01T00:00:00.000',
        N'2019-01-01T00:00:00.000',
        N'2020-01-01T00:00:00.000'

The years here go up to 2020. This covers you in case your source database is either the full size version, or the 2010 or 2013 version.

The scheme I create puts everything on the primary filegroup. Since the bones of this database is a backup/restore, it has four files, but they’re all in the primary filegroup. You’re welcome to change that, but I don’t find it necessary.

CREATE PARTITION SCHEME pscheme AS PARTITION pfunc ALL TO ([PRIMARY]);

Also I’m a bit lazy.

Swimmin’ In Synonyms With Their Own Condominiums


I did something kind of goofy at first. When I was experimenting with doing this, everything was in one database. So uh, I suffixed all the column store tables with “_cs”.

That turned out to be really annoying when running different demo scripts against this database, because I’d have to change all the names. To get around that, I created synonyms, but that felt hacky too.

For instance, any time I needed to write a DMV query that referenced a table, I’d screw up and reference the synonym, which doesn’t quite work as well as you’d hope. By that I mean not at all.

In the final version, all object names match those in other versions of the Stack Overflow database.

Tomorrow


In tomorrow’s post, I’ll show you parts of the script that use a StackOverflow database of your choice as a source to build up the column store version.

Thanks for reading!

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.