In the recent months I had the pleasure of hearing a presentation by Randolph West, data platform MVP, on how he took on a challenge of replicating a trillion row insert into a table in SQL Server. This was a combination of the Mount Everest idiom, and to prove that SQL Server performance is not as bad as some people say it is. This lit a spark in me, and I wanted to try it out as well. Joe Obbish was the first to do it, and Randolph was the second, so now I wanted to be in that group.
In order to remain unbiased in my attempt I refused to read either blog about how it was accomplished. I instead decided to run a series of test at 1 billion rows, to see if I could get it running with the best efficiency possible and create a baseline for how long I could expect the entire process to take. Joe’s method accomplished the task in approximately two and a half days.
I will state early on that I did not manage to match this, instead the total operation at the end took 111 hours to insert 1,000,000,715,000 rows into a clustered columnstore index. Given that I accomplished this with a simple query structure though, I feel that I was able to achieve what I set out to do.
The data lab machine I have at home is called the Databox, and is a stand-alone custom build I put together specifically for data operations testing and data science competitions I am involved in. I built it myself, and I designed the build to be optimized for disk IO and data operations.
· Intel 8th generation I3 (4 core, no hyperthreading) i3-8100 3.6 ghz 16 PCI Express lanes
· MSI H370M Bazooka Motherboard
· 16 GB DDR4-2400 Ram
· 32 GB Intel Optane 3dX-Point Memory Cache (4 PCIe Lanes)
· GTX 750 ti (8 PCIe Lanes)
· 2 TB SeaGate FireCuda SHDD (Hybrid Hdd) (8gb NAND Flash on 2tb 4 head disk with 64 mb cache)
(Benchmark Disk Write is just shy of 300mbs second on the disk, achieving between 2x and 3x the throughput of the FireCuda alone, and achieving results that rival well over half the current market of SATA SSDs, and for a fraction of the out of pocket cost of going to a large volume NVMe drive)
The data lab machine runs the current build of insider edition Windows 10 Pro, and SQL Server 2017 (Version 14.0.2002.14) Developer Edition. I also run the SQL ML Server, and ata the time of testing the March 2018 Edition of Power BI Report Server, and Power BI Desktop, but the reporting server is not the important part today.
For the purposes of the query, my instance was configured with minimum 6gb, maximum 12gb of memory is reserved for SQL Server, MaxDOP is set to 4 Cost threshold was set to 4. Query store was active on the database set to all, single disk filegroup on Primary fast disk, and Memory optimized filegroup of 200 MB.
The Testing Process
First thing first, I needed a structure. After some quick reading on the Clustered Columnstore Index compression structures, I found that a columnstore contains 1,048,576 rows. So I determined that this would be the optimal load volume per cycle, as it would allow the columnstore to be loaded without having to manage a deltastore in a separate column throughout the operation.
I built two tables, one that would hold an insert value, and one that would be the target table. I populated the insert table with 1,048,576 rows of the value “False”, in a bit format column with a clustered columnstore index.
SET NOCOUNT ON;
DECLARE @i INT = 0
WHILE @i <= 1048575
INSERT INTO dbo.InstTbl (Beta) VALUES (0)
SET @i = @i + 1
SET NOCOUNT OFF
If at First you don’t Succeed, Fail and Fail Again.
Now to the testing, with some guesswork along the way. My first thought was to run a single transaction process, in the same way I ran the InstTbl. This led to some disappoint results, as the compression wouldn’t run until after the insert was completed, and this created a large amount of extra write operations which dragged performance down considerably.
My second thought was to create a stored procedure to run the inserts, and to call the stored procedure, loading the insert through temp. This was an improvement, but still a disappointing result. Onward and upward.
My third attempt was to load the insert table into memory optimized temp storage and then drop that into the insert operation. This too yielded disappointing results, as the memory optimized table required a separate index and as such, it decreased index scan performance. This was also the cause of a concern with running a partition swap operation and chunking the insert into larger sections but was not possible because the indexes did not properly match. (If someone has a suggestion for this approach I am open to hear it.)
My fourth attempt was to load a table variable and then insert from there, but again, this created too much overhead on the bus, and my write speeds went down.
Fifth attempt, and the final one that I ended up using was simple, probably a bit too simple, but it worked amazingly well. Set stats off, Set processor affinity, (Was supposed to set nocount on, but forgot during the test run), and insert with OPTION MaxDOP 4 and force a parallel plan with a query hint (I know, but it worked)
In the course of running the test I averaged out to 150 million rows per minute being inserted into the table and was able to complete it in just over 111 hours, as stated. I can put Trillion Row + Dataset on my resume and move on with some other projects. I will circle back, after now reading how Joe Accomplished it, and see if I can achieve similar or better results.
ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU=0;
SET NOCOUNT OFF
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
SET NOCOUNT ON;
INSERT INTO TrillionPlusTwo (Alpha)
SELECT Beta FROM InstTbl
OPTION (MAXDOP 4, USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'))
We don’t always get things a 100% right the first time. There are many factors that affect performance, and a series of smaller scale test was able to get me into the performance range I wanted. If I had gone ahead with attempt 1, it would have taken over a month to run. By looking at different configurations, and settings I was able to achieve what I wanted to, and now have this table available to me for additional testing.
When I speak about AGILE and DevOps, this test shows a microcosm of why I prefer this method. I don’t go through a lot of prepatory planning, but use some expertise and develop working options. We toss a bunch of small value tests through these options (ideally with an end user, but for this case it was just me), and within a very short time, come up with a direction that will work long term, and check all the boxes necessary and if necessary make cuts where we can regain value fast, and circle back later on these.
This was a fantastic experience, and something I will build on in future.
PS Special thanks to Joe Obbish and Randoph West.