The merge bug
This is the seventh post in a series of twelve in Tim Fords Entry-Level Content Challenge.
Last month, I filed my very first connect item (Microsoft Connect is a portal, where you can submit bugs in Microsoft products such as SQL Server, Visual Studio). I had found a bug in SQL Servers query processor, which had to give up producing a query plan for a SQL merge statement and throw an exception instead. In this blog post, I will tell the story about… the merge bug. Not so much to understand merge internals, but more to describe a trouble shooting process, which ultimately ended in finding a bug in SQL Server.
It all started with a support ticket with a bug reported by one of our Effektor customers. Effektor is (among other things) a data warehouse generator, and I’m the product manager for the product. I triage every single bug reported, and for bugs reported in the generated data warehouse objects, I also try to reproduce them. I this case, the error occurred in the data import layer between the DSA and EDW data warehouse layers (if you don’t know anything about data warehouse architecture, then just ignore these details, or browse my blog post on this topic. See link below). The client reported that the auto-generated stored procedure for loading into the dimension DimCustomer failed with the following error message:
Msg 8618, Level 16, State 2, Procedure Import_CRM_DimCustomer, Line 108 [Batch Start Line 0]
The query processor could not produce a query plan because a worktable is required, and its minimum row size exceeds the maximum allowable of 8060 bytes. A typical reason why a worktable is required is a GROUP BY or ORDER BY clause in the query. If the query has a GROUP BY or ORDER BY clause, consider reducing the number and/or size of the fields in the clause. Consider using prefix (LEFT()) or hash (CHECKSUM()) of fields for grouping or prefix for ordering. Note however that this will change the behavior of the query.
Hmm. I looked at the error message. The query processor could not produce a query plan. This was not good. I have worked with SQL tuning for some years now, and it might be that the query optimizer spits out a badly performing plan, but I had never been in a situation, where no plan could be produced. The error message also gave some general advice on typical reasons for this problem. I therefore took a look at the SQL statement in line 108 (I have edited the code a bit to illustrate structure. The full merge statement for the dimension is more than 200 lines long):
INSERT #t2mergeoutput (
, <DimCustomer attributes here>
SELECT 2 AS ScdType
, mergeOutput.<DimCustomer attributes here>
MERGE [EDW].[DimCustomer] AS target
USING #Source AS source
ON (target.BusinessKey = source.BusinessKey)
WHEN NOT MATCHED BY TARGET THEN
<DimCustomer attributes here>
source.<DimCustomer attributes here>
AND IsCurrent = 1
<checks on whether any of the attribute values have changed>
SET target.IsCurrent = 0
, target.EndDate = DATEADD( second,-1, GETDATE() )
$Action AS MergeAction
, deleted.<DimCustomer attributes here>
, inserted.<DimCustomer attributes here>
) AS mergeOutput;
This piece of code in the stored procedure checks if rows from the source table with a given business key exist in the EDW table. If not, then a new row is inserted into the EDW table. If a row exists with a give business key, then it checks if any non-key attributes differ between the two rows. If so, then the current row in the T2 history in the EDW table is updated to a historical row. The result of the merge (all new and updated rows) are then written to a temp table for further processing.
Looking at the merge statement, I couldn’t see anything related to group by and/or order by. So, the advice in the error message did not really help out here. Then I read the error message again, and another small hint presented itself: …minimum row size exceeds the maximum allowable of 8060 bytes... I had a look at the source and destination tables for the merge statement. They looked like this:
CREATE TABLE EDW.DimCustomer(
CRM Name nchar(255),
... <many more columns here>
Bingo! I found the problem. Summing up all those nchar(255) columns would definitely exceed 8060 bytes, since a nchar(255) is 2*255=560 bytes long, and because there were a lot of them. To test my hypothesis, I removed all columns after the column OldCRMName, and… the query processor still threw an exception. Bugger. I now suspected that I might have hit a bug in SQL Server. Wow. That was almost a thought of heresy. When I first started in database consulting in Miracle back in 2008, I sometimes heard colleagues suggesting that a bug could be caused by a bug in the database software. I was almost in chock about this. Think about it: (relational) databases supports SQL, and SQL builds on relational algebra, which builds on set theory. And set theory is without flaws (not really, but let’s keep the focus here…). So database software cannot have bugs. Well, of course they can.
With a suspicion of a SQL Server optimizer bug, I now started experimenting with the query, adding columns to the select part, the match part and the output part. And I found something really odd. When the output part of the statement contained more than 8 attributes, then I could trigger the exception. At the SQL Nexus conference in Copenhagen in May 2016, I consulted Itzik Ben-Gan on the matter, and his first reaction was: it’s a bug. So, I filed a connect ticket, and tweeted about it (being a little bit anxious, when I added the #sqlhelp hashtag).
A few hours later, Paul White wrote me
I had a look at the link (see reference below), but decided that it wasn’t relevant in this particular case, because
The target table was a heap
The statement fails even when the target table is empty
But, I did send him the repro scripts, and he answered
MERGE is required to raise an error if it would affect the same target table row more than once. Without constraints that guarantee this condition cannot occur, the query processor has to add extra operators to detect this runtime condition and raise an error if it occurs. The extra operators are a Segment, Sequence Project, and Assert. The Segment tags rows at the start of each new group (as defined by the MERGE's ON clause), the Sequence Project numbers each row in the group, and the Assert throws an error if any row gets a number > 1 (indicating a non-deterministic change).
The Segment requires input sorted by the ON clause column(s), which is BusinessKey in your example. Without an index to provide this order, a Sort is required. The Sort's worktable has a maximum in-row storage size of 8060 bytes. When you add more columns to the OUTPUT clause, you are increasing the width of the rows that have to be stored in the sort worktable.
The optimizer does not currently consider all possible placements for Sort operators. If it did, it might be able to reduce the in-row size below the limit in some cases, but that is not something it does today. The bottom line is that if a Sort cannot be avoided, some MERGE statements will fail to produce an execution plan if the only Sort positions considered would all involve a worktable that exceeds the maximum. The query writer has no control over this, beyond avoiding the Sort in the first place.
Ideally, as documented, both tables in the MERGE would have a unique index on the join column (BusinessKey) and would also be covering or clustered. The two unique indexes guarantee that non-deterministic updates cannot occur, so the Sort, Segment, Sequence Project, and Assert operators are not needed.
I doubt these indexes are possible in your case, since BusinessKey does not appear to be unique in the target table. It may not even be unique in the Source table (though it should be). You could try adding a (covering) unique index just on BusinessKey on the Source table, even if that means preprocessing to remove duplicates. There is no guarantee this will work though. Even if a plan is produced, it may not be very efficient (be wary of any Eager Table Spool that pops up on the dimension table).
My honest advice here is to avoid MERGE for this. It might seem cool to do an insert and update in the same statement, with an output clause to an outer insert (and it is!) but my experience has been that almost without exception, separate DML operations are simpler, easier to tune, and ultimately better performing than a single MERGE. This might sound counter-intuitive, but I can only share what my experience has been, and I like to think I know my way around execution plans and the SQL Server execution engine pretty well.
So, I tried to create a unique constraint on the target table, and this didn’t help either. I wrote him back
Thank you for the very informative explanation. I tried to make a unique constraint on (BusinessKey, StartDate) on the target, as these two attributes are unique. As you suggested, it doesn't help. The weird thing is that the query processor can make plan if I remove (parts of) the OUTPUT clause.
I think we will follow your advice and rewrite the code generator to produce separate DML statements instead of using merge.
Then we started the tedious work to change the code generator in Effektor to produce insert/update statements instead of merge statements. Now our product works again, the customer has patched his installation, and I learned a lot during the process. I hope you learned something as well.
And that's it for this months entry-level content challenge blog post.
Read more on data warehouse architecture here:
Read Paul Whites post on Stack Exchange here:
Or visit Paul Whites blog here:
See the connect item here: