November 5, 2017 at 9:53 pm
Comments posted to this topic are about the item Name Value Pairs Table Handling
November 6, 2017 at 3:42 am
Hi there,
thank you for the article, I really enjoyed reading it and following the thought-process and as such feel kind of bad for criticizing on it but...
... I really, really don't like the approach taken to Transform and Load the data into the target-tables as it screams row-by-row processing.
I can see how the Extract part (read: scraping the non-structured source-data from the web) and storing it in name/value pairs is a versatile method and by its nature is something that works in a 'one at a time' rhythm. But for the next step I truly think you should consider a SET-based operation.
Going row-by-row I can see how the user-defined table type comes in handy here, especially as you then can use it as a parameter in a scalar function, but again, performance wise you're really shooting yourself in the foot because row-by-row (aka RBAR, the A stands for agonyzing, and with reason =) will take MUCH longer to process (100k rows in 3s might be acceptable, but 1s would be better). Furthermore, Scalar Functions look like the next best thing since sliced bread in MSSQL as it makes code much easier to read and allows for code-reuse... but it really hurts performance-wise when it comes down to it. It has its uses, but in general I'd simply advise against it.
Some things that came to mind:
SELECT @OrderNumber = [TransactionParameterValue]
FROM [dbo].[NameValueTable]
WHERE [TransactionGUID] = @TransactionGUID
AND [TransactionParameterName] = 'OrderNumber';
and
SELECT @OrderAmount = [TransactionParameterValue]
FROM [dbo].[NameValueTable]
WHERE [TransactionGUID] = @TransactionGUID
AND [TransactionParameterName] = 'OrderAmount';
Can be merged into :
SELECT @OrderNumber = (CASE [TransactionParameterValue] WHEN 'OrderNumber' THEN [TransactionParameterValue] ELSE @OrderNumber END),
@OrderAmount = (CASE [TransactionParameterValue] WHEN 'OrderAmount' THEN [TransactionParameterValue] ELSE @OrderAmount END)
FROM [dbo].[NameValueTable]
WHERE [TransactionGUID] = @TransactionGUID
AND [TransactionParameterName] IN ('OrderNumber', 'OrderAmount');
This will easily outperform fetching the values separately, especially if you're going to be using more than 2 name/value combinations per transaction-id.
Additionally, you probably want the PK of the NameValueTable to be TransactionGUID + TransactionParameterName + all the rest rather than have TransactionParameterName all in the end.
(Then again, this might be purely because you simplified the example for the sake of the article, I can't tell. It might also cause gigantic fragmentation depending on how the the data is coming in from the CLR part, I'm simply pointing out that for this part it isn't optimal)
Anyway, combining the fetch of the variables into a single select will still force you to use RBAR based processing. (in this case: transaction-id per transaction-id processing)
To get it 'really' set-based you'll have to write a query along the lines of:
;WITH Transactions
AS (SELECT DISTINCT [TransactionGUID]
FROM [dbo].[NameValueTable])
-- INSERT [dbo].[TargetTable] ( [TransactionGUID], [OrderNumber], [OrderAmount] )
SELECT [TransactionGUID] = t.[TransactionGUID],
[OrderNumber] = Convert(varchar(10), nbr.[TransactionParameterValue]),
[OrderAmount] = Convert(DECIMAL(12, 4), amt.[TransactionParameterValue])
FROM Transactions t
LEFT OUTER JOIN [dbo].[NameValueTable] nbr
ON nbr.[TransactionGUID] = t.[TransactionGUID]
AND nbr.[TransactionParameterValue] = 'OrderNumber'
LEFT OUTER JOIN [dbo].[NameValueTable] amt
ON amt.[TransactionGUID] = t.[TransactionGUID]
AND amt.[TransactionParameterValue] = 'OrderAmount'
I'll agree that for 1 record this seems overkill and probably will show a higher cost than the (merged) query above; however, when unleashing this on the 100.0000 records you spoke of the gain will be substantial
(Bonus, I also think it's fairly simple to read/maintain. Making changes to the user-defined-table-type comes with its own headaches, mostly regarding dependencies).
My 2 cents,
Roby
November 6, 2017 at 7:16 am
@jcboyer-1091017 What you're doing in this article is called pivoting. And you're also doing it in about the most inefficient way possible. Using the PIVOT statement or using CASE statements would likely be much faster.
I'm not sure what the criteria are for having an article posted on SQL Server Central but it's disconcerting that mis-information like this can be published front and center. The website needs some better editorial controls or it's not of much value to be subscribed to it.
November 6, 2017 at 7:44 am
Hi, I think you have a typo, you create the type with one name and then use it with a slightly different name. Good Article, Rick
CREATE TYPE dbo.NameValueType AS TABLE versus DECLARE @NameValue AS NameValuePairType
November 6, 2017 at 8:12 am
For such an EAV design I would expect the distribution of @TransactionParameterName values to be extremely variable. Some @TransactionParameterName values are likely to be singular, while other @TransactionParameterName values are likely to be ubiquitous. A parameter sniffing concern can thus be inherent in this proposed EAV design. If the passed @TransactionParameterName value has low cardinality (and if SQL Server's plan cache for a query that references @TransactionParameterNameis is cold), an index seek can be compiled, and cached. But if the passed @TransactionParameterName value has a high cardinality (and if SQL Server's plan cache for the exact same query is cold), an index scan can be compiled, and cached. Whether a cached seek or a cached scan is appropriate for subsequent @TransactionParameterName values (for the same query) can be very problematic, in terms of resource consumption. Because performance can be unpredictable, the code may be forced into submitting WITH RECOMPILE or one of its handful of alternatives which are used to address parameter sniffing concerns. Some of the alternatives force the code into chossing which of their "babies" they want to "kill".
Even though some of the proposed code looks simple, keep in mind that SQL is an interpreted language: What looks simple can impose complex requirements on the client. Be sure to test all uses of this single table design (using a variety of @TransactionParameterName values), and think about prohibiting developers from adding functionality that JOINs to other tables.
Overall, this type of design makes my namesake spin in his grave. I would give SSIS a serious look before allowing this type of design (or CLR) into a SQL Server system :).
November 6, 2017 at 8:31 am
MgSam - Monday, November 6, 2017 7:16 AM@jcboyer-1091017 What you're doing in this article is called pivoting. And you're also doing it in about the most inefficient way possible. Using the PIVOT statement or using CASE statements would likely be much faster.I'm not sure what the criteria are for having an article posted on SQL Server Central but it's disconcerting that mis-information like this can be published front and center. The website needs some better editorial controls or it's not of much value to be subscribed to it.
Thanks. The source data is in a string (JSON or XML) As noted in the post, the goal was not creating tables off these strings - the 'Normal' way, but being more flexible in getting 'any' data as returned by the Web Site API. Less flexibility meant more coding down the road.
Publishing odd ways to do some actions in SQL is valuable -- as your post noted, when suggesting Pivot to replace my code. It would help me a lot if you could give an example of how you would deal with the problem I am faced with. The important objective is to keep the code flexible enough to harvest API data with minimal code changes. Speed was not the only criteria in this application.
Using SQL 2016, with its native JSON data type, has opened new possibilities also. For example, I found it more efficient to parse JSON in T-SQL versus the .Net code in a CLR assembly. However, my original code was created in 2014, before MS noticed JSON. Improvements in T-SQL mean that I need to revisit the code and see how it can be improved. I am currently doing this and may propose a postscript to my initial post.
An other response used the CASE statement. I will look at it in more details. When dealing with over 100 variables CASE may start to look HEAVY and possibly harder to maintain. One gains speed, with extra maintenance costs.
November 6, 2017 at 9:14 am
Thanks Roby.
The proposed CASE strategy was NOT on my radar when I first tried to harvest API from Web sites in 2007. At the time I simply used OPENXML and created tables fitting the XLM documents to parse. At that time we were doing the heavy loading at the site level, formatting the data there and sending them according to our specifications. <customername> was cast in stone and its result was a column customername. The point is that life was simpler - predefined XML (no JSON then) loading in predefined tables.
There was a significant cost in transforming the source XML into our required design. Time to production became an issue when dealing with new Web engines.
I like what you are suggesting. What concerns me is the lengthy code involved if we need to deal with 100 CASE statements. Name mapping (from <customername> to CustomerName, say) can be handled at the same time, a bonus. But I am not sure how to maintain such complex code and how to add new variables without breaking the code.
PIVOT and CASE should definitely be considered. I am working on code refactoring at this point and will take these tools in consideration.
November 6, 2017 at 9:19 am
rick.foster - Monday, November 6, 2017 7:44 AMHi, I think you have a typo, you create the type with one name and then use it with a slightly different name. Good Article, RickCREATE
TYPE dbo.NameValueType AS TABLE versus DECLARE @NameValue AS NameValuePairType
Good catch. Thanks.
November 6, 2017 at 9:21 am
I always read articles like this because I want to see what others are having to do in the EAV world. One thing that I've noticed is that these kinds of articles tend to be rated very poorly because of a knee-jerk "this is bad and non-relational" reaction. Another thing I've noticed is that the comments never, ever propose a workable solution when you need this kind of flexibility. EAV is bad. Yeah, we get it. Now please tell us how to get rid of it without giving our users carte blanche to create innumerable flat unrelated tables. How is that better?
November 6, 2017 at 9:30 am
SoHelpMeCodd - Monday, November 6, 2017 8:12 AMFor such an EAV design I would expect the distribution of @TransactionParameterName values to be extremely variable. Some @TransactionParameterName values are likely to be singular, while other @TransactionParameterName values are likely to be ubiquitous. A parameter sniffing concern can thus be inherent in this proposed EAV design. If the passed @TransactionParameterName value has low cardinality (and if SQL Server's plan cache for a query that references @TransactionParameterNameis is cold), an index seek can be compiled, and cached. But if the passed @TransactionParameterName value has a high cardinality (and if SQL Server's plan cache for the exact same query is cold), an index scan can be compiled, and cached. Whether a cached seek or a cached scan is appropriate for subsequent @TransactionParameterName values (for the same query) can be very problematic, in terms of resource consumption. Because performance can be unpredictable, the code may be forced into submitting WITH RECOMPILE or one of its handful of alternatives which are used to address parameter sniffing concerns. Some of the alternatives force the code into chossing which of their "babies" they want to "kill".Even though some of the proposed code looks simple, keep in mind that SQL is an interpreted language: What looks simple can impose complex requirements on the client. Be sure to test all uses of this single table design (using a variety of @TransactionParameterName values), and think about prohibiting developers from adding functionality that JOINs to other tables.
Overall, this type of design makes my namesake spin in his grave. I would give SSIS a serious look before allowing this type of design (or CLR) into a SQL Server system :).
Overall, this type of design makes my namesake spin in his grave. I would give SSIS a serious look before allowing this type of design (or CLR) into a SQL Server system :).
I did expect this type of push back; and hoping to get some better expert views on how to magically transform JSON (XML) coming in multiple designs into workable and normalized database objects. Your comments on the distribution of data are noted. In fact, I tested pushing the NameValue table into OLTP. And it makes a huge difference in speed.
I am not clear why CLR should not be part of the SQL tool kit. SSIS was considered and not used back in the 2010 time frame. I cannot remember why not.
As usual, if you only have a hammer, everything looks like a nail. And that may have been my case. Thanks for your comments.
November 6, 2017 at 9:37 am
EAV datasets are always challenging to normalize, but they are a fact of life when working with manufacturing historian systems and batch recipes in manufacturing execution systems. I have had to use XSLT to perform the extract and transform; interesting to see a variety of approaches in T-SQL. Thanks.
November 6, 2017 at 9:42 am
dmccarron.maine - Monday, November 6, 2017 9:37 AMEAV datasets are always challenging to normalize, but they are a fact of life when working with manufacturing historian systems and batch recipes in manufacturing execution systems. I have had to use XSLT to perform the extract and transform; interesting to see a variety of approaches in T-SQL. Thanks.
Indeed. I've done everything from product configuration to environmental data management (my current job) and have not found a way around at least some level of EAV. I tried to design around it in one part of my current job, and ended up with column names that had units and analytes hard-coded like "arsenic_mg_l". It is a freaking nightmare. I'm still trying to extract that data into a sample-result-parameter-unit-value design which is far more flexible and elegant.
November 6, 2017 at 9:49 am
brucepierson2 - Monday, November 6, 2017 9:42 AMdmccarron.maine - Monday, November 6, 2017 9:37 AMEAV datasets are always challenging to normalize, but they are a fact of life when working with manufacturing historian systems and batch recipes in manufacturing execution systems. I have had to use XSLT to perform the extract and transform; interesting to see a variety of approaches in T-SQL. Thanks.Indeed. I've done everything from product configuration to environmental data management (my current job) and have not found a way around at least some level of EAV. I tried to design around it in one part of my current job, and ended up with column names that had units and analytes hard-coded like "arsenic_mg_l". It is a freaking nightmare. I'm still trying to extract that data into a sample-result-parameter-unit-value design which is far more flexible and elegant.
I should have added LIMS to the list of offending systems. 🙂
November 6, 2017 at 10:04 am
jcboyer-1091017 - Monday, November 6, 2017 9:14 AMI like what you are suggesting. What concerns me is the lengthy code involved if we need to deal with 100 CASE statements. Name mapping (from <customername> to CustomerName, say) can be handled at the same time, a bonus. But I am not sure how to maintain such complex code and how to add new variables without breaking the code.PIVOT and CASE should definitely be considered. I am working on code refactoring at this point and will take these tools in consideration.
Personally I'd probably try to have all the mapping sitting in a "meta" table that maps ParameterName to DestinationColumn for each ETL process and then build a stored procedure that creates stored procedures that contain all the actual SQL needed to do the heavy lifting. Maintaining the table should be fairly straightforward (add a description/notes column so the thing is self-documenting!) and after each change you can simply regenerate the mapper-procedure. Whether there are 10 fields to be mapped or 100, who cares. Heck, you could probably allow for formulas in there if you don't need to fear evil users who might try to inject dangerous SQL code.
Assuming all ETL target-tables share 'more or less' the same logic you'd only need to be concerned about maintaining the mapping table and the (single) stored procedure that creates all the 'actual' stored procedures. Working with dynamic SQL isn't always as clear as writing code directly; but rather that than having to maintain a stored procedure for each and every destination table separately.
November 6, 2017 at 10:04 am
dmccarron.maine - Monday, November 6, 2017 9:49 AMbrucepierson2 - Monday, November 6, 2017 9:42 AMdmccarron.maine - Monday, November 6, 2017 9:37 AMEAV datasets are always challenging to normalize, but they are a fact of life when working with manufacturing historian systems and batch recipes in manufacturing execution systems. I have had to use XSLT to perform the extract and transform; interesting to see a variety of approaches in T-SQL. Thanks.Indeed. I've done everything from product configuration to environmental data management (my current job) and have not found a way around at least some level of EAV. I tried to design around it in one part of my current job, and ended up with column names that had units and analytes hard-coded like "arsenic_mg_l". It is a freaking nightmare. I'm still trying to extract that data into a sample-result-parameter-unit-value design which is far more flexible and elegant.
I should have added LIMS to the list of offending systems. 🙂
Haha, no doubt! It's kind of amazing, isn't it, the number of heavily analytical systems out there that are "designed all wrong"? The fact is that by designing our current system "wrong", we have been able to do amazing things like add result qualifiers, limits, quality control, public/private visibility, validation results... the list goes on. Imagine adding a new column, for every single new requirement... for every single combination of analyte/unit... and you start to get some idea of the absolute insanity of trying to be "relationally pure". With the great (but admittedly complex) PIVOT stuff in T-SQL, we can make the reports look just as if the database were physically designed that way. Win-win.
Viewing 15 posts - 1 through 15 (of 26 total)
You must be logged in to reply to this topic. Login to reply