In the recent months I had the pleasure of hearing a presentation by
Randolph West, data platform MVP, on how he took on a challenge of replicating
a trillion row insert into a table in SQL Server. This was a combination of the
Mount Everest idiom, and to prove that SQL Server performance is not as bad as
some people say it is. This lit a spark in me, and I wanted to try it out as
well. Joe Obbish was the first to do it, and Randolph was the second, so now I
wanted to be in that group.
In order to remain unbiased in my attempt I refused to read either blog
about how it was accomplished. I instead decided to run a series of test at 1
billion rows, to see if I could get it running with the best efficiency
possible and create a baseline for how long I could expect the entire process
to take. Joe’s method accomplished the task in approximately two and a half days.
I will state early on that I did not manage to match this, instead the
total operation at the end took 111 hours to insert 1,000,000,715,000 rows into
a clustered columnstore index. Given that I accomplished this with a simple
query structure though, I feel that I was able to achieve what I set out to do.
The Hardware
The data lab machine I have at home is called the Databox, and is a
stand-alone custom build I put together specifically for data operations
testing and data science competitions I am involved in. I built it myself, and
I designed the build to be optimized for disk IO and data operations.
·
Intel 8th generation I3 (4 core, no hyperthreading)
i3-8100 3.6 ghz 16 PCI Express lanes
·
MSI H370M Bazooka Motherboard
·
16 GB DDR4-2400 Ram
·
32 GB Intel Optane 3dX-Point Memory Cache (4 PCIe
Lanes)
·
GTX 750 ti (8 PCIe Lanes)
·
2 TB SeaGate FireCuda SHDD (Hybrid Hdd) (8gb NAND
Flash on 2tb 4 head disk with 64 mb cache)
(Benchmark Disk Write is just shy of 300mbs second on the disk,
achieving between 2x and 3x the throughput of the FireCuda alone, and achieving
results that rival well over half the current market of SATA SSDs, and for a
fraction of the out of pocket cost of going to a large volume NVMe drive)
Software
The data lab machine runs the current build of insider edition Windows
10 Pro, and SQL Server 2017 (Version 14.0.2002.14) Developer Edition. I also
run the SQL ML Server, and ata the time of testing the March 2018 Edition of
Power BI Report Server, and Power BI Desktop, but the reporting server is not
the important part today.
For the purposes of the query, my instance was configured with minimum 6gb, maximum 12gb of memory is
reserved for SQL Server, MaxDOP is set to 4 Cost threshold was set to 4. Query
store was active on the database set to all, single disk filegroup on Primary
fast disk, and Memory optimized filegroup of 200 MB.
The
Testing Process
First thing first, I needed a structure. After some quick reading on the
Clustered Columnstore Index compression structures, I found that a columnstore
contains 1,048,576 rows. So I determined that this would be the optimal load
volume per cycle, as it would allow the columnstore to be loaded without having
to manage a deltastore in a separate column throughout the operation.
I built two tables, one that would hold an insert value, and one that
would be the target table. I populated the insert table with 1,048,576 rows of
the value “False”, in a bit format column with a clustered columnstore index.
USE ColumstoreTest;
GO
SET NOCOUNT ON;
DECLARE @i INT = 0
BEGIN TRANSACTION
WHILE @i <=
1048575
BEGIN
INSERT INTO dbo.InstTbl (Beta) VALUES
(0)
SET @i = @i + 1
END
COMMIT TRANSACTION
SET NOCOUNT OFF
GO
If
at First you don’t Succeed, Fail and Fail Again.
Now to the testing, with some guesswork along the way. My first thought
was to run a single transaction process, in the same way I ran the InstTbl.
This led to some disappoint results, as the compression wouldn’t run until after
the insert was completed, and this created a large amount of extra write
operations which dragged performance down considerably.
My second thought was to create a stored procedure to run the inserts,
and to call the stored procedure, loading the insert through temp. This was an
improvement, but still a disappointing result. Onward and upward.
My third attempt was to load the insert table into memory optimized temp
storage and then drop that into the insert operation. This too yielded disappointing
results, as the memory optimized table required a separate index and as such,
it decreased index scan performance. This was also the cause of a concern with running
a partition swap operation and chunking the insert into larger sections but was
not possible because the indexes did not properly match. (If someone has a
suggestion for this approach I am open to hear it.)
My fourth attempt was to load a table variable and then insert from
there, but again, this created too much overhead on the bus, and my write
speeds went down.
Fifth attempt, and the final one that I ended up using was simple,
probably a bit too simple, but it worked amazingly well. Set stats off, Set
processor affinity, (Was supposed to set nocount on, but forgot during the test
run), and insert with OPTION MaxDOP 4 and force a parallel plan with a query
hint (I know, but it worked)
In the course of running the test I averaged out to 150 million rows per
minute being inserted into the table and was able to complete it in just over
111 hours, as stated. I can put Trillion Row + Dataset on my resume and move on
with some other projects. I will circle back, after now reading how Joe
Accomplished it, and see if I can achieve similar or better results.
USE ColumstoreTest;
GO
ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU=0;
GO
SET NOCOUNT OFF
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
SET NOCOUNT ON;
GO
INSERT INTO TrillionPlusTwo (Alpha)
SELECT Beta
FROM InstTbl
OPTION (MAXDOP 4, USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'))
GO 954675
The
Lesson
We don’t always get things a 100% right the first time. There are many
factors that affect performance, and a series of smaller scale test was able to
get me into the performance range I wanted. If I had gone ahead with attempt 1,
it would have taken over a month to run. By looking at different
configurations, and settings I was able to achieve what I wanted to, and now
have this table available to me for additional testing.
When I speak about AGILE and DevOps, this test shows a microcosm of why
I prefer this method. I don’t go through a lot of prepatory planning, but use some expertise and develop working options. We toss a bunch of small value tests through these options (ideally with an end user, but for this case it was just me), and within a very short time, come up with a direction that will work
long term, and check all the boxes necessary and if necessary make cuts where we can regain value fast, and circle back later on these.
This was a fantastic experience, and something I will build on in future.
SQL Doch
PS Special thanks to Joe Obbish and Randoph West.
It is great achievement- thanks for update
ReplyDelete