November 26, 2014 at 6:39 am
skanker (11/26/2014)
I misunderstood what the question was asking so chose neither.I can see now from the discussion that this was testing other things - however based on just the question i could only choose neither. Sorry as it is probably my inexperience that led to this choice. Thanks to everyone, as ever, for the discussion as this is where I pick up loads. π
I also chose "neither", on the assumption that MERGE would perform better than the two options offered. Some might suggest "who cares, it's only five rows". I'd argue that developers should make the effort to get it right, every time.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
November 26, 2014 at 6:59 am
This was removed by the editor as SPAM
November 26, 2014 at 7:02 am
Hugo Kornelis (11/26/2014)
Sorry for the harsh words, but this is absolute rubbish.The code in option 2 will always do two lookups. One for the EXISTS, and then another one for either the INSERT or the UPDATE. The code in option 1 will do a single lookup if the row is already there (for the UPDATE), and then only do a second action when the row was not there. So on average, this code is 50% faster (assuming an equal chance of the row being new or already present).
The only correct answer is option 1.
The links provided are bogus. They are about a different comparison: no UPDATE involved, and the WHERE clause was not on the primary key.
Thanks Hugo. This is what I thought as well, and seemed to be supported by the execution plans.
I couldn't figure out what I must be missing. :crazy:
November 26, 2014 at 7:06 am
Hugo Kornelis (11/26/2014)
Sorry for the harsh words, but this is absolute rubbish.The code in option 2 will always do two lookups. One for the EXISTS, and then another one for either the INSERT or the UPDATE. The code in option 1 will do a single lookup if the row is already there (for the UPDATE), and then only do a second action when the row was not there. So on average, this code is 50% faster (assuming an equal chance of the row being new or already present).
The only correct answer is option 1.
The links provided are bogus. They are about a different comparison: no UPDATE involved, and the WHERE clause was not on the primary key.
Actually Hugo, I think you are wrong.
If you make two tables, once called @table and the other called @tible, say, both with the same content, and make one of the methods operate on @table and the other on @tible, you can have a batch that contains both methods. That makes it possible to bring up the estimated execution plans, and discover that the optimizer believes option 2 is better. In case you don't trust the optimizer, you can then run the batch and look at the actual execution plans. The second option turns out better than the first again - indeed the whole of the second option (two plans - a plan whose last part is a cond-with-query op, and a second plan for the query component of that op) costs less than the first of the two queries in the first option (two plans, one for each of the two queries). Now this is of course for SQL Server 2014 developer edition on my laptop, with other SQL Server releases and different hardware the result may be different; but the measurement that your comment provoked me to undertake seems to indicate that your comment is excessively critical of the question, unwarrantedly harsh.
edit: Of course looking at the two queries in two separate batches will give no useful comparisons - the total cost rounds to 0 in each case; having both methods in one batch allows the engine to do the comparison using actual costs, instead of giving you rounded answers which you can't usefully compare because of the rounding.
And of course like many others I can see no point in actually doing this other than with MERGE; but I do see some point in knowing which of the two wrong methods performs worst.
Tom
November 26, 2014 at 7:23 am
TomThomson (11/26/2014)
Hugo Kornelis (11/26/2014)
Sorry for the harsh words, but this is absolute rubbish.The code in option 2 will always do two lookups. One for the EXISTS, and then another one for either the INSERT or the UPDATE. The code in option 1 will do a single lookup if the row is already there (for the UPDATE), and then only do a second action when the row was not there. So on average, this code is 50% faster (assuming an equal chance of the row being new or already present).
The only correct answer is option 1.
The links provided are bogus. They are about a different comparison: no UPDATE involved, and the WHERE clause was not on the primary key.
Actually Hugo, I think you are wrong.
If you make two tables, once called @table and the other called @tible, say, both with the same content, and make one of the methods operate on @table and the other on @tible, you can have a batch that contains both methods. That makes it possible to bring up the estimated execution plans, and discover that the optimizer believes option 2 is better. In case you don't trust the optimizer, you can then run the batch and look at the actual execution plans. The second option turns out better than the first again - indeed the whole of the second option (two plans - a plan whose last part is a cond-with-query op, and a second plan for the query component of that op) costs less than the first of the two queries in the first option (two plans, one for each of the two queries). Now this is of course for SQL Server 2014 developer edition on my laptop, with other SQL Server releases and different hardware the result may be different; but the measurement that your comment provoked me to undertake seems to indicate that your comment is excessively critical of the question, unwarrantedly harsh.
edit: Of course looking at the two queries in two separate batches will give no useful comparisons - the total cost rounds to 0 in each case; having both methods in one batch allows the engine to do the comparison using actual costs, instead of giving you rounded answers which you can't usefully compare because of the rounding.
And of course like many others I can see no point in actually doing this other than with MERGE; but I do see some point in knowing which of the two wrong methods performs worst.
Hi Tom, I'm sorry but I have to disagree with you on this. You cannot judge performance based on the relative cost of the two plans. Too many of us have been bitten too often by this. Timing using profiler or simply using date variables is accurate but requires a scaled-up test with likely thousands or even tens of thousands of rows to show any significant difference - and of course, when you get to these volumes, you're not going to use a table variable π
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
November 26, 2014 at 7:28 am
Anju Renjith (11/26/2014)
Could you please reword the question and let me know where I went wrong in wording the question.Thanks for your time
You phrased the question as what is the best performance way to insert a fifth row with value 2000 in column 2 and neither option is good.
Perhaps if you'd only given the choices of options 1 and 2 it would have been a more useful question as that would have forced people to pick between two bad options.
November 26, 2014 at 8:23 am
TomThomson (11/26/2014)
If you make two tables, once called @table and the other called @tible, say, both with the same content, and make one of the methods operate on @table and the other on @tible, you can have a batch that contains both methods. That makes it possible to bring up the estimated execution plans, and discover that the optimizer believes option 2 is better. In case you don't trust the optimizer, you can then run the batch and look at the actual execution plans. The second option turns out better than the first again - indeed the whole of the second option (two plans - a plan whose last part is a cond-with-query op, and a second plan for the query component of that op) costs less than the first of the two queries in the first option (two plans, one for each of the two queries). Now this is of course for SQL Server 2014 developer edition on my laptop, with other SQL Server releases and different hardware the result may be different; but the measurement that your comment provoked me to undertake seems to indicate that your comment is excessively critical of the question, unwarrantedly harsh.
Do not trust the cost figures in execution plans for tuning. They are estimates. (Yes, even in an actual execution plan). So if you have performance issues caused by a misestimate, the figure you are looking at tells you exactly the same incorrect information that the optimizer based its bad plan choice on.
In this specific case, the estimated execution plan will always show both queries in the IF EXISTS version, because the entire batch is compiled. The actual execution plan will show only the queries that actually ran. If you run the batch a second time, I expect the UPDATE + condition INSERT to become cheaper because now the actual plan will no longer include the INSERT.
However, you are still looking at lies.
Bottom line is, option 1 executes either 1 or 2 queries; option 2 always executes 2 queries. One of those queries may or may not be marginally cheaper than the other (I doubt it, though), but the cost of skipping a full query outweighs that by a huge margin.
November 26, 2014 at 9:50 am
peter.row (11/26/2014)
I answered neither because neither option answers the question posed.You say you have an identity column, you add 4 rows and want to add a 5th.
Since this is a table variable it only lives in your own scope and so it is not unreasonable to say that the best way is to simply say:
insert into @table (col2) values (5000)
There is no need to check for the existence of a col1 = 5 because you know it doesn't exist.
Come on people - I know you're trying to keep the questions as short as possible but at least be explicit or have a 1 line "assume that ..." so that we don't have to use the multiple choices answers to figure out what the assumptions are.
Given the question, the "neither" answer indicates that you believe they are equal in performance, not the you should chose "neither" method.
Don Simpson
November 26, 2014 at 9:53 am
Hugo Kornelis (11/26/2014)
Sorry for the harsh words, but this is absolute rubbish.The code in option 2 will always do two lookups. One for the EXISTS, and then another one for either the INSERT or the UPDATE. The code in option 1 will do a single lookup if the row is already there (for the UPDATE), and then only do a second action when the row was not there. So on average, this code is 50% faster (assuming an equal chance of the row being new or already present).
The only correct answer is option 1.
The links provided are bogus. They are about a different comparison: no UPDATE involved, and the WHERE clause was not on the primary key.
Actually, given that the question was which method is fastest for the INSERT, then option 2 is correct. If the question were "Which method is most efficient for adding and updating data in this table," then you would be correct.
Don Simpson
November 26, 2014 at 10:39 am
I really don't like this question because it could create wrong assumptions.
As Hugo said, option 1 could be faster if there are large possibilities of a fifth row existing. MERGE could be better if we wanted to do an UPSERT, but the question doesn't mention the update.
If there's a possibility of having more than 4 rows in the table, then there's the possibility of not having a row with col1=5 as identity columns can have gaps.
If there's no possibility of having more than 4 rows, then the validation is not needed and a simple INSERT would be better.
Finally, any performance advice should be based on real tests with enough rows to show a significant difference. That's why my answer was neither.
November 26, 2014 at 10:41 am
I have to disagree with your answer. You simply said you want a fifth row. Doing a simple insert, as you did at the start, is the fastest way. Therefore the correct answer is 'Neither'. If you intended to add special requirements for pre-checking for specific ID of 5, you should state that clearly.
Given your point that EXISTS is efficient, I agree. A specific UPDATE example, without the confusion of the INSERT, would have been much clearer.
Thanks.
November 26, 2014 at 1:45 pm
DonlSimpson (11/26/2014)
Actually, given that the question was which method is fastest for the INSERT, then option 2 is correct. If the question were "Which method is most efficient for adding and updating data in this table," then you would be correct.
No, in the case of an INSERT, they would be the same.
Option 1 will first scan the entire table (since there is no index) to find and update any matching rows. (The lack of a primary key or unique constraint on the identity column will result in a plan that entertains the option of multiple matches!). But in the INSERT case, there are no matches, so this UPDATE will read all rows and do nothing else; and then move to the INSERT.
Option 2 will first scan the entire table (since there is no index) to find the first matching row. The lack of a primary key or unique constraint on the identity column will result in a plan that entertains the option of multiple matches, and "fixes" this by adding logic to short-circuit after the first match. But in the INSERT case, there are no matches, so this EXISTS test will read all rows and do nothing else; and then move to the INSERT.
If there is a difference at all, I expect option 1 to be ever so slightly faster even in this case (compare the execution plans to see the extra operators needed to implement the short-circuit logic!). But in reality, this difference is probably too small to measure. I tried to, but was unable to come up with a good way to actually measure the performance of the two options; the STATISTICS IO results are exactly identical; STATISTICS TIME rounds down to zero, and showing SYSDATETIME() before and after execution also was inconsistent (and in most cases implied a zero duration). Repeatedly executing the batch might work, but I think the overhead of either creating and filling the table variable or undoing the insert would have a bigger impact than the actual statements to be tested.
November 26, 2014 at 1:49 pm
Final note: I do want to stress that the message that the author tried to convey is correct and relevant. He just chose an incorrect context.
But if you need to test for existence of a row and are in doubt between either (a) SELECT ... FROM ... / IF @@ROWCOUNT = 0; or (b) IF (SELECT COUNT(*) FROM ...) = 0; or (c) IF EXISTS (SELECT * FROM ...) - than always go for option c. There will be no measurable difference if the row does not exist, but if the row does exist, the third option will save time by stopping after the first match, whereas options a and b will continue to look for and process / count the remaining matches.
And when possible, try to make sure that there is an index available for the optimizer to use an efficient index seek instead of having to scan all rows....
November 27, 2014 at 3:50 am
Good question, guessed it right, thanks.
Thanks & Best Regards,
Hany Helmy
SQL Server Database Consultant
November 27, 2014 at 4:43 am
Thanks all for your valuable comments.
I will surely consider all your comments when I frame my next question.
Sorry to all who felt that its a bad question
and
Thanks to all who supported me a lot
π
Viewing 15 posts - 16 through 30 (of 30 total)
You must be logged in to reply to this topic. Login to reply