November 7, 2009 at 5:40 am
Hi friends
I had a quiet strange behavior with joining a table valued function last week.
Business case:
We've got a distributed system. To keep data in sync we use export and import processes. We have to import files with several thousand rows (using XML). After parsing the file we have to update existing data and add new rows. (I must not use any technology like BCP, SSIS or SS2k8 MERGE.)
To avoid thousands of single selects I used a inline table-valued function which gets all concatenated keys. Internally, this function calls a T-SQL split function (like Jeff Moden's function). I concatenate all keys within the client process, send the keys to the server and JOIN the split function to the data table.
My first split function used a inline Numbers table created with CROSS JOINS:
---========================================================
-- create a Jeff Moden style string split function
-- using a inline Numbers table
IF (OBJECT_ID('SplitStringInlineNumbers') IS NULL)
EXECUTE ('CREATE FUNCTION SplitStringInlineNumbers () RETURNS TABLE AS RETURN SELECT 1 A');
GO
ALTER FUNCTION SplitStringInlineNumbers (
@text VARCHAR(MAX)
,@separator CHAR(1)
)
RETURNS TABLE
AS
RETURN
WITH
n1 (Num) AS (SELECT 1 UNION ALL SELECT 1), -- 2
n2 (Num) AS (SELECT 1 FROM n1 CROSS JOIN n1 b), -- 4
n3 (Num) AS (SELECT 1 FROM n2 CROSS JOIN n2 b), -- 16
n4 (Num) AS (SELECT 1 FROM n3 CROSS JOIN n3 b), -- 256
n5 (Num) AS (SELECT 1 FROM n4 CROSS JOIN n4 b), -- 65536
-- 4294967296
Numbers (Num) AS (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM n5 CROSS JOIN n5 b
)
SELECT
SUBSTRING(
@text
,t.Num + 1
,CHARINDEX(@separator, @text, t.Num + 1) - t.Num - 1
) AS Item
FROM Numbers t
WHERE
t.Num < LEN(@text)
AND SUBSTRING(@text, t.Num, 1) = @separator
GO
Everything worked fine, though, as I looked into the ERRORLOG I noticed that SQL Server has no clue about SQL :-D. There have been 32 "Missing Join Predicate" warnings to tell me that my cool function appears to be scrap whenever SQL Server created a execution plan for this function.
Next thought was, no problem use a database Numbers table. (Bad idea...). I slightly changed the function and removed the inline Numbers table:
---========================================================
-- create a Jeff Moden style string split function
-- using a database Numbers table
IF (OBJECT_ID('SplitStringDbNumbers') IS NULL)
EXECUTE ('CREATE FUNCTION SplitStringDbNumbers () RETURNS TABLE AS RETURN SELECT 1 A');
GO
ALTER FUNCTION SplitStringDbNumbers (
@text VARCHAR(MAX)
,@separator CHAR(1)
)
RETURNS TABLE
AS
RETURN
SELECT
SUBSTRING(
@text
,t.Num + 1
,CHARINDEX(@separator, @text, t.Num + 1) - t.Num - 1
) AS Item
FROM Numbers t
WHERE
t.Num < LEN(@text)
AND SUBSTRING(@text, t.Num, 1) = @separator
Warnings gone, but ends up with a completely strange execution plan using a huge index spool and table spool. The old (potentially bad warned) function took 110ms for thousand keys, the new function took about 85,000ms!?!
I noticed both functions are marked as non-deterministic (why?), so I changed my previously used TVFs to a procedure which executes the split function into a table variable and joins this table to the data. Finally everything works fine.
Anyway, why are both functions marked as non-deterministic? Why does the inline-numbers table work 850 times faster than the database-numbers table?
In respect to help you to help me, here is a sample environment which can be used as copy-paste. (Also copy the previous two functions into tempdb.)
Setup tables and data on tempdb
SET NOCOUNT ON;
USE tempdb;
GO
---========================================================
-- drop existing test tables
IF (OBJECT_ID('TestReferences') IS NOT NULL)
DROP TABLE TestReferences;
IF (OBJECT_ID('TestData') IS NOT NULL)
DROP TABLE TestData;
IF (OBJECT_ID('Numbers') IS NOT NULL)
DROP TABLE Numbers;
GO
---========================================================
-- create test tables
-- we need a numbers table
CREATE TABLE Numbers
(
Num INT NOT NULL
PRIMARY KEY CLUSTERED
WITH (FILLFACTOR = 100)
);
-- any sample data
CREATE TABLE TestData
(
Id INT NOT NULL IDENTITY
PRIMARY KEY CLUSTERED
,SomeInt INT
);
-- several references which relate to the test data
CREATE TABLE TestReferences
(
Id INT NOT NULL IDENTITY
PRIMARY KEY CLUSTERED
,DataId INT NOT NULL
REFERENCES TestData (Id)
,Reference VARCHAR(36)
);
-- reference column is indexed
CREATE INDEX IX_TestReferences_Reference ON TestReferences (Reference);
GO
---========================================================
-- data
-- create a numbers table with 50,000 rows
INSERT INTO Numbers
SELECT TOP(50000)
ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM master.sys.all_columns c1
CROSS JOIN master.sys.all_columns c2;
-- create 10,000 data rows
INSERT INTO TestData
SELECT TOP(10000)
Num
FROM Numbers
-- create two references for each data row
INSERT INTO TestReferences
SELECT
d.Id
,CONVERT(VARCHAR(36), NEWID())
FROM TestData d
CROSS JOIN (SELECT 1 A UNION ALL SELECT 2) blah
GO
Call the functions
SET NOCOUNT ON;
USE tempdb;
GO
--DBCC FREEPROCCACHE;
---==================================================================
-- call the inline-numbers function joined to the data table
-- get some references from TestReferences table to search
DECLARE @keys VARCHAR(MAX);
SELECT @keys =
CHAR(9)
+ (SELECT TOP(1000) Reference + CHAR(9) FROM TestReferences FOR XML PATH(''));
--SET STATISTICS XML ON;
--SET STATISTICS TIME ON;
-- select all TestData related to the specified keys
SELECT DISTINCT
d.*
FROM SplitStringInlineNumbers(@keys, CHAR(9)) keys
JOIN TestReferences r ON keys.Item = r.Reference
JOIN TestData d ON r.DataId = d.Id;
--SET STATISTICS TIME OFF;
--SET STATISTICS XML OFF;
GO
---==================================================================
-- call the database-numbers function joined to the data table
-- get some references from TestReferences table to search
DECLARE @keys VARCHAR(MAX);
SELECT @keys =
CHAR(9)
+ (SELECT TOP(1000) Reference + CHAR(9) FROM TestReferences FOR XML PATH(''));
--SET STATISTICS XML ON;
--SET STATISTICS TIME ON;
-- select all TestData related to the specified keys
SELECT DISTINCT
d.*
FROM SplitStringDbNumbers(@keys, CHAR(9)) keys
JOIN TestReferences r ON keys.Item = r.Reference
JOIN TestData d ON r.DataId = d.Id;
--SET STATISTICS TIME OFF;
--SET STATISTICS XML OFF;
GO
---==================================================================
-- call the database-numbers function into a table variable and
-- join this to the data table
-- get some references from TestReferences table to search
DECLARE @keys VARCHAR(MAX);
SELECT @keys =
CHAR(9)
+ (SELECT TOP(1000) Reference + CHAR(9) FROM TestReferences FOR XML PATH(''));
-- copy result of split function into a table variable
DECLARE @lookup TABLE (Item VARCHAR(36) PRIMARY KEY CLUSTERED);
INSERT INTO @lookup
SELECT
Item
FROM SplitStringDbNumbers(@keys, CHAR(9));
-- select all data by joining the lookup table
SELECT DISTINCT
d.*
FROM @lookup keys
JOIN TestReferences r ON keys.Item = r.Reference
JOIN TestData d ON r.DataId = d.Id;
First part calls the inline-numbers split function joined to the data table.
Second part calls the database-numbers split function joined to the data table. BAD PART!
Third part calls the database-numbers split function into a table-variable and joins it. (Performs good).
(I cannot attach the execution plans, since I get an error while upload)
I'd be glad about any suggestions about the WHY?
Thanks
Flo
PS to the regulars: Splitting will never end :hehe:
November 7, 2009 at 6:46 am
Flo,
The first part of your question is easy: SQL Server marks all functions as non-deterministic unless they are schema-bound. Schema-binding a function forces the engine to evaluate the content of the function to decide if it is deterministic or not.
Simply change: RETURNS TABLE to RETURNS TABLE WITH SCHEMABINDING and then
select OBJECTPROPERTYEX(object_id('SplitStringInlineNumbers', 'IF'), 'IsDeterministic') returns '1'.
For details, see http://blogs.msdn.com/sqlprogrammability/default.aspx?p=5
That's one of my most favouritest links of all time by the way 🙂
Paul
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
November 7, 2009 at 6:49 am
Ha, so on second look, the same thing answers the second question. When the function does data access (the database numbers table) it can't be deterministic, so SQL Server can't rely on it always returning the same output for the same input, so it can't safely cache the results in a table spool - it must be recalculated every time, and it must access the database each time too! The link I posted before explains it pretty well, let me know if it is less than clear.
Paul
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
November 7, 2009 at 6:52 am
Paul White (11/7/2009)
Simply change: RETURNS TABLE to RETURNS TABLE WITH SCHEMABINDING and thenselect OBJECTPROPERTYEX(object_id('SplitStringInlineNumbers', 'IF'), 'IsDeterministic') returns '1'.
Cool, thanks! Learned something new
For details, see http://blogs.msdn.com/sqlprogrammability/default.aspx?p=5
That's one of my most favouritest links of all time by the way 🙂
Immediately added to my blog list 🙂
Greets
Flo
November 7, 2009 at 6:56 am
Paul White (11/7/2009)
Ha, so on second look, the same thing answers the second question. When the function does data access (the database numbers table) it can't be deterministic, so SQL Server can't rely on it always returning the same output for the same input, so it can't safely cache the results in a table spool - it must be recalculated every time, and it must access the database each time too! The link I posted before explains it pretty well, let me know if it is less than clear.Paul
Thanks again, the link you posted explains everything. Apparently SS2k5 does not mark the inline-numbers based solution as deterministic but handles it as a deterministic function.
Thanks for opening my eyes!
Flo
November 7, 2009 at 7:09 am
Hmmm well the issue is correct even if my rushed explanation there isn't quite right. The cause is still the deterministic thing, and the query plan problems still stem directly from that, but not in quite the way I had assumed. Nevertheless, you seem to be totally across the reasons for the behaviour, and that was the point of the thread really 🙂
It's also good to be able to post that link again - there are still far too many experienced SQL people that don't know about that behaviour with SCHEMABINDING. Even those that do don't always realise that it applies to functions that don't do any data access...
One of these days I will find a T-SQL function that does data access which doesn't suck horribly, but I am not holding my breath. I sometimes wish data access from T-SQL functions would be deprecated and swiftly removed, for all the problems that accompany it. If functions couldn't do data access, they could always be checked for determinism without (sometimes pointless) schema-binding. Oh well.
Paul
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
November 7, 2009 at 7:22 am
Paul White (11/7/2009)
Nevertheless, you seem to be totally across the reasons for the behaviour, and that was the point of the thread really 🙂
Yep 🙂
It's also good to be able to post that link again - there are still far too many experienced SQL people that don't know about that behaviour with SCHEMABINDING. Even those that do don't always realise that it applies to functions that don't do any data access...
Like me.. Seems to be a topic I have to study - when I find the time. Unfortunately there are too many other important things I have to study (especially WCF, what is a huge universe with many gains and pains).
I sometimes wish data access from T-SQL functions would be deprecated and swiftly removed, for all the problems that accompany it. If functions couldn't do data access, they could always be checked for determinism without (sometimes pointless) schema-binding. Oh well.
Good point. And I learned something really important with problem. TVFs (especially single-statement) look really cool but should be used carefully.
Greets
Flo
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply