July 23, 2009 at 12:14 am
Comments posted to this topic are about the item Gain Space Using XML data type
August 4, 2009 at 1:28 am
Hi,
I notice you're using very short tag names in your xml presumably to reduce space used.
If you use a typed xml column i.e. associate it with an xsd, the length of the tags shouldn't matter, and you may get other associated benefits. When I've some time I'll do a comparison.
Regards,
David McKinney.
August 4, 2009 at 1:53 am
Hi,
You can reduce the space similarly without XML. You can use the table instead of the XML column.
CREATE TABLE Rainfall_LocDate
(LocDateId int IDENTITY(1, 1)
,LocID int
,Date datetime
,CONSTRAINT PK_Rainfall_LocDate PRIMARY KEY NONCLUSTERED(LocID,Date)
)
CREATE TABLE Rainfall_Data
(LocDateId int
,Hour tinyint
,Rainfall numeric(10,8)
,CONSTRAINT PK_Rainfall_Data PRIMARY KEY NONCLUSTERED(LocDId,Hour)
)
Regards,
Pavel Krticka
August 4, 2009 at 2:53 am
The space savings are impressive and it does lend itself to a number of real-world situations but you also need to highlight the shortcomings of the XML data type as shown in SQL books online
http://msdn.microsoft.com/en-us/library/ms189887.aspx
Limitations of the xml Data Type
Note the following general limitations that apply to the xml data type:
The stored representation of xml data type instances cannot exceed 2 GB.
It cannot be used as a subtype of a sql_variant instance.
It does not support casting or converting to either text or ntext. Use varchar(max) or nvarchar(max) instead.
It cannot be compared or sorted. This means an xml data type cannot be used in a GROUP BY statement.
It cannot be used as a parameter to any scalar, built-in functions other than ISNULL, COALESCE, and DATALENGTH.
It cannot be used as a key column in an index. However, it can be included as data in a clustered index or explicitly added to a nonclustered index by using the INCLUDE keyword when the nonclustered index is created
August 4, 2009 at 4:21 am
A very interesting article indeed!
However, I think that you shouldn't compare query performance without any index in the conventional table.
If you add an index (code below) and rerun your query, your query performance will be about 1000 times better. (Yet, the space requirement will also increase remarkably.)
Rgds,
Pexi
CREATE NONCLUSTERED INDEX [IX_Rainfall_Con_Date_Hour] ON [Rainfall_Con]
(
[Date] ASC,
[Hour] ASC,
[LocID] ASC,
[Rainfall] ASC
)WITH (
PAD_INDEX = OFF,
STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF,
IGNORE_DUP_KEY = OFF,
DROP_EXISTING = OFF,
ONLINE = OFF,
ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON)
ON [PRIMARY]
August 4, 2009 at 4:53 am
This script is closely based on that in the article.
I would strongly encourage the author of this article to try it out 🙂
Nice idea for an article, but size and speed are not XML strengths! 😉
Paul
-- *****************************************************************************************
-- WARNING: This script contains commands to clear the server-wide data and procedure caches
-- Do not run on anything other than a TEST instance!
--
-- Please ensure Actual Execution Plan is OFF for representative results
-- Total test time is 10 to 60 seconds (including test data creation), depending on hardware
-- *****************************************************************************************
USE tempdb;
GO
--
-- Drop the test tables if they exist
--
IF OBJECT_ID(N'dbo.Rainfall_Con', N'U') IS NOT NULL DROP TABLE dbo.Rainfall_Con
IF OBJECT_ID(N'dbo.Rainfall_XML', N'U') IS NOT NULL DROP TABLE Rainfall_XML;
IF OBJECT_ID(N'tempdb..#BitBucket', N'U') IS NOT NULL DROP TABLE #BitBucket;
GO
--
-- Create the conventional table
--
CREATE TABLE dbo.Rainfall_Con
(
[LocID] INTEGER NOT NULL,
[Date] DATETIME NOT NULL,
[Hour] TINYINT NOT NULL,
[Rainfall] DECIMAL(10,8) NOT NULL,
);
GO
--
-- Populate the conventional table with one year's data for 100 locations
-- This should complete in less than 30 seconds
--
DECLARE @Rows BIGINT;
SET @Rows = 100 * 365 * 24; -- 24 hours' data for 365 days for 100 locations
RAISERROR('Creating test data...', 0, 1) WITH NOWAIT;
;WITH Numbers (N)
AS (
-- Dynamically create rows numbered 0 to 875999
SELECT TOP (@Rows)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
FROM master.sys.all_columns C1,
master.sys.all_columns C2,
master.sys.all_columns C3
)
INSERT dbo.Rainfall_Con WITH (TABLOCKX)
(
[LocID],
[Date],
[Hour],
[Rainfall]
)
SELECT LocID = (N / 365 / 24) + 1,
[Date] = DATEADD(DAY, (N / 24) % 366 , '2008-01-01 00:00:00.000'),
[Hour] = N % 24,
[Rainfall] = RAND(CHECKSUM(NEWID())) * 10 -- A new random value for each row
FROM Numbers;
GO
--
-- Now create the primary key
-- Doing this *after* inserting the data results in a compact and contiguous index
-- If the index were to exist during the insert, index pages may split resulting in
-- a lower data density, and significant fragmentation
--
RAISERROR('Creating index...', 0, 1) WITH NOWAIT;
ALTER TABLE dbo.Rainfall_Con
ADD CONSTRAINT PK_Rainfall_Con
PRIMARY KEY CLUSTERED([Date], [Hour], [LocID])
WITH (FILLFACTOR = 100);
GO
--
-- Create the XML table
--
CREATE TABLE dbo.Rainfall_XML
(
[LocID] INTEGER NOT NULL,
[Date] DATETIME NOT NULL,
[Rainfall] XML NOT NULL,
);
GO
--
-- Populate the XML table from the conventional table
--
RAISERROR('Creating XML data...', 0, 1) WITH NOWAIT;
INSERT Rainfall_XML WITH (TABLOCKX)
SELECT [LocID],
[Date],
(
-- The hourly data as XML
SELECT H = [Hour],
V = [Rainfall]
FROM dbo.Rainfall_Con A WITH (READUNCOMMITTED)
WHERE A.[LocID] = B.[LocID]
AND A.[Date] = B.[Date]
FOR XML RAW('RN'), ROOT('RT')
)
FROM Rainfall_Con B WITH (READUNCOMMITTED)
GROUP BY
[LocID],
[Date];
GO
--
-- Add the primary key now, for the same reasons as before
--
RAISERROR('Creating index...', 0, 1) WITH NOWAIT;
ALTER TABLE dbo.Rainfall_XML
ADD CONSTRAINT PK_Rainfall_XML
PRIMARY KEY CLUSTERED([Date], [LocID])
WITH (FILLFACTOR = 100);
GO
--
-- Show the space used by each table
--
-- Results on my system:
--
-- Conventional : 876,000 rows, 27016KB reserved, 26856KB data, 96KB index, 64KB unused
-- XML : 36,500 rows, 36504KB reserved, 36504KB data, 112KB index, 64KB unused
--
-- ### XML representation is 35% *larger* ###
--
EXECUTE sp_spaceused Rainfall_Con;
EXECUTE sp_spaceused Rainfall_XML;
GO
--
-- This table is used to capture the results from the test
-- SELECT statements. This avoids any delays in sending the
-- rows to the client from skewing the results
--
CREATE TABLE #BitBucket
(
[LocID] INTEGER NOT NULL,
[Date] DATETIME NOT NULL,
[Hour] TINYINT NOT NULL,
[Rainfall] DECIMAL(10,8) NOT NULL,
);
GO
--
-- TEST START
--
RAISERROR('Starting test', 0, 1) WITH NOWAIT;
--
-- Remove all ad-hoc SQL query plans from cache
--
DBCC FREESYSTEMCACHE (N'SQL Plans');
GO
--
-- Start with a cold data cache
-- Each implementation is run 10 times
-- One time with a cold cache, then nine times with a warm cache
-- This seems more representative to me, feel free to change it
--
DBCC DROPCLEANBUFFERS;
GO
/* SQL */
INSERT #BitBucket WITH (TABLOCKX)
SELECT TP.[LocID],
TP.[Date],
TP.[Hour],
TP.[Rainfall]
FROM dbo.Rainfall_Con TP WITH (READUNCOMMITTED)
WHERE TP.[Date] = '20080106'
AND TP.[Hour] = 15
ORDER BY
LocID ASC;
GO 10
--
-- For fairness, recreate the dump table for the second run
-- TRUNCATE TABLE might give the second run an unfair advantage
-- since mixed and uniform extent allocations would not need
-- to be re-done
--
DROP TABLE #BitBucket;
GO
CREATE TABLE #BitBucket
(
[LocID] INTEGER NOT NULL,
[Date] DATETIME NOT NULL,
[Hour] TINYINT NOT NULL,
[Rainfall] DECIMAL(10,8) NOT NULL,
);
GO
--
-- Second run - XML
--
DBCC DROPCLEANBUFFERS;
GO
/* XML */
INSERT #BitBucket WITH (TABLOCKX)
SELECT TP.LocID,
TP.[Date],
[Hour] = Rainfall.value('(/RT/RN[@H=15]/@H)[1]','tinyint'),
Rainfall_Con = Rainfall.value('(/RT/RN[@H=15]/@V)[1]','numeric(10,8)')
FROM dbo.Rainfall_XML TP WITH (READUNCOMMITTED)
WHERE TP.[Date] = '20080106'
ORDER BY
LocID;
GO 10
--
-- Show the total execution resource usage for all ten runs
--
SELECT query = LEFT(QT.[text], 9),
execution_plan = QP.query_plan,
run_count = QS.execution_count,
total_cpu_time_µs = QS.total_worker_time,
total_logical_reads = QS.total_logical_reads,
total_elapsed_time_µs = QS.total_elapsed_time,
avg_cpu_time_µs = QS.total_worker_time / QS.execution_count,
avg_logical_reads = QS.total_logical_reads / QS.execution_count,
avg_elapsed_time_µs = QS.total_elapsed_time / QS.execution_count
FROM sys.dm_exec_query_stats QS
CROSS
APPLY sys.dm_exec_sql_text (QS.[sql_handle]) QT
CROSS
APPLY sys.dm_exec_query_plan (QS.[plan_handle]) QP
WHERE QT.[text] LIKE '%Rainfall%'
AND QT.[text] NOT LIKE '%dm_exec_query_stats%'
ORDER BY
QS.last_execution_time ASC;
GO
--
-- Sample results:
--
-- SQL Averages: CPU 683µs; Elapsed time: 683µs; Logical reads: 114
-- XML Averages: CPU 80078µs; Elapsed time: 83691µs; Logical reads: 126
--
-- Tidy up
--
GO
IF OBJECT_ID(N'dbo.Rainfall_Con', N'U') IS NOT NULL DROP TABLE dbo.Rainfall_Con
IF OBJECT_ID(N'dbo.Rainfall_XML', N'U') IS NOT NULL DROP TABLE Rainfall_XML;
IF OBJECT_ID(N'tempdb..#BitBucket', N'U') IS NOT NULL DROP TABLE #BitBucket;
GO
RAISERROR('Test run complete', 0, 1) WITH NOWAIT;
--
-- END SCRIPT
--
August 4, 2009 at 5:53 am
Nice article, I learned something about XML in SQL Server today.
As Paul pointed out, it's impossible that XML storage needs less space than conventional storage.
With appropriate indexing, there's no way that XML works faster than a relational table.
I suggest that you add some more comments to your article to avoid misleading somebody on this topic.
Nice idea, anyway.
Gianluca
-- Gianluca Sartori
August 4, 2009 at 6:45 am
If you make your primary key in both tables CLUSTERED (which would be a reasonable choice in this case given that some of the queries are ORDERing/GROUPing by locid), the storage requirements are somewhat different... XML now uses more storage.
name rows reserved data index_sizeunused
Rainfall_XML 36600 36744 KB 36600 KB 112 KB32 KB
name rows reserved data index_sizeunused
Rainfall_Con 878400 27080 KB 26928 KB 96 KB56 KB
August 4, 2009 at 6:53 am
Where I've found the XML data type useful has been in situations where you can't control schema changes to a relation (table) and want to record all data from it even if columns are added, modified or removed.
Logging triggers, or archiving processes in these cases can take advantage of "select * from table for xml" and insert into an XML column in a log/archive/audit table.
For semi-structured data that has variable schema elements, XML is pretty ideal. Beyond that, relational data structures are better.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
August 4, 2009 at 7:54 am
Another thing to point out about XML's use of space is that, barring compression, the data is being stored as characters wrapped in tags, making storage of integer, numeric, or floating point data LESS efficient.
Datatype SQL Length XML XML Length
--------- ----------- ---- ------------
TinyInt 1 Byte 255 10
SmallInt 2 Bytes 32767 12
Int 4 Bytes 2147483647 17
FLOAT 8 Bytes -2.23.40E-308 20
In the author's example, he chose numeric(10,8) to store the rainfall amount, which would require only 9 bytes for SQL to store, but 18 for XML to store (90.12345678).
The author also chose to impose an [hour] column on the conventional table, which was unnecessary, because it could have been included as part of the [date] column.
Finally, he could have chosen to have a separate column for each hour in his schema for the conventional table, rather than making it one row per hour. (That wouldn't have been my first choice, because I feel it lacks flexibility in manipulating the data. But it is essentially the approach that was taken with the table that used XML.)
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
August 4, 2009 at 8:24 am
It's an interesting look from a developer point of view. Compared to the power and speed of the relational engine, I think space savings are a minimal thing to worry about. Disk prices fall at a rate far beyond that of the improvements in CPU and memory when you compare price/performance.
I agree with many of the comments, that a well designed relational system is probably better, but this is an interesting solution and idea. If disk space matters a lot, perhaps it's worth considering, but I would recommend that you investigate some of the other suggestions first.
August 4, 2009 at 8:37 am
I think the merit in this article lies more in the techniques that are covered, rather than in the premise that it ultimately failed to prove.
The article does show a couple of good xquery examples in context - how to select a value from a node and incorporate it in a resultset, and how to modify the value of a node.
SELECT Rainfall.value('(/RT/RN[@H=15]/@H)[1]','tinyint') AS Hour
,Rainfall.value('(/RT/RN[@H=15]/@V)[1]','numeric(10,8)') AS Rainfall
UPDATE TP SET Rainfall.modify('replace value of (/RT/RN[@H="20"]/@V)[1] with "8.12345678" ')
All in all, not a bad article, well written and clear explanations, even if in some respects fundamentally flawed!
August 4, 2009 at 3:51 pm
Paul White (8/4/2009)
This script is closely based on that in the article.I would strongly encourage the author of this article to try it out 🙂
Nice idea for an article, but size and speed are not XML strengths! 😉
Paul
MythBuster extraordinaire... that script goes into my "Pork Chop Hall of Fame". Nice job, Paul.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 4, 2009 at 3:54 pm
Steve Jones - Editor (8/4/2009)
It's an interesting look from a developer point of view. Compared to the power and speed of the relational engine, I think space savings are a minimal thing to worry about.
Space savings are certainly important to me especially when it comes to backups and restores but according to Paul's script... what space savings?
--Jeff Moden
Change is inevitable... Change for the better is not.
August 4, 2009 at 4:54 pm
Didn't test it, and perhaps there are none. In any case, I think it's been a nice debate/discussion of the technique.
Viewing 15 posts - 1 through 15 (of 37 total)
You must be logged in to reply to this topic. Login to reply