February 15, 2016 at 6:50 am
500k in 10 minutes on production. Sqlplan attached
SELECT TOP (500000) X.ADDRESS_ID
FROM ( SELECT ADDRESS_ID FROM dbo.ADDRESS
EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI1
EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI2
EXCEPT SELECT ADDRESS_ID2 FROM EARLY_PAYOFF_INFO EPI3
EXCEPT SELECT ADDRESS_ID FROM CHECK_REQUEST_DEFAULT_ADDRESS
EXCEPT SELECT PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID FROM DIRECT_LENDING_SERVICE
EXCEPT SELECT ADDRESS_ID FROM PACKET
EXCEPT SELECT ADDRESS_ID FROM WORK_QUEUE_ITEM
EXCEPT SELECT ADDRESS_ID FROM LEGAL_ENTITY
EXCEPT SELECT ADDRESS_ID FROM SERVICE_REQUEST
EXCEPT SELECT ADDRESS_ID FROM LIENHOLDER) X(ADDRESS_ID);
February 15, 2016 at 6:56 am
Can you try it in order from most rows to least, see Jeff's follow-up post to mine. The thought being to reduce the set as much as possible as early as possible.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
February 15, 2016 at 9:32 am
Sergiy (2/14/2016)
Jeff Moden (2/14/2016)
Sergiy (2/14/2016)
I wonder - how long the SELECT part of the query would take?The DELETE part would probably take as long as not doing any deletes and would probably delete the same number of rows... none. Mission accomplished! π
Why do you think it's gonna be none rows deleted?
I could be wrong but it looks as if all you need is one address ID in the other tables that match the first address ID in the address table and nothing will be deleted.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 15, 2016 at 1:52 pm
Jeff Moden (2/15/2016)
Sergiy (2/14/2016)
Jeff Moden (2/14/2016)
Sergiy (2/14/2016)
I wonder - how long the SELECT part of the query would take?The DELETE part would probably take as long as not doing any deletes and would probably delete the same number of rows... none. Mission accomplished! π
Why do you think it's gonna be none rows deleted?
I could be wrong but it looks as if all you need is one address ID in the other tables that match the first address ID in the address table and nothing will be deleted.
No, you're not wrong. (as always, b...y h..l!) π
But look at the data.
500 mil addresses stored in the database.
USPS delivers to 154 mil addresses.
So, the client is 3 times bigger than US.
Who's bigger?
China?
I doubt Chinese would outsource a database of a national scale to India.
India?
IndianRock said they are an outsource company, so, it's not India.
Only one conclusion left - this is not normalised data storage.
Every new case adds a record of a new client, its lender, associated legal entity, etc.
Every of those records has an address which is recorded as an add-hoc set of strings.
In attempt to simulate data normalisation the developers put all addresses into a single table, but all those 9 addresses create a new record in the table no matter what.
The lending agency most likely did not move once on a course of the last 10 years, but the application adds a new record for its address every time it's mentioned in a new case.
That's the only way I can explain such a number of addresses in the car dealership database.
Now, what's happened when a case expires?
They remove it from the database.
And they have to remove all the records associated with that case.
Because they all are one-off, and they cannot be reused in other cases.
Logically, cases are removed in chronological order, the oldest first.
Oldest case created addresses with the least ADDRESS_ID.
When you remove it the MIN used ADDRESS_ID moves up, and my DELETE will find some work to do.
Yes, I admit, there might be some exclusions, some jumps in the queue.
Some addresses may be late for deletion for a week or 2.
But they'll catch up pretty soon, I'm sure.
IndianRock - how far is my guess from the actual state of affairs?
_____________
Code for TallyGenerator
February 16, 2016 at 4:42 am
Orlando Colamatteo (2/15/2016)
Can you try it in order from most rows to least, see Jeff's follow-up post to mine. The thought being to reduce the set as much as possible as early as possible.
You'd think the optimiser would rearrange the join order but it doesn't seem to, and the natural order in the query does have an influence as you'd expect.
I set up some sample data like this:
SELECT TOP 15000000
Address_ID = ID
INTO #Address
FROM dbo.SomeTableWithAUniqueIDColumn
ORDER BY NEWID()
CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #Address (Address_ID)
with another two tables having the same number of rows captured at random, plus another four tables with a million rows captured at random (from 21 million rows in the permanent table).
Then I ran these two queries and recorded the best of three executions for each of the three conditions:
DROP TABLE #Temp1
SELECT Address_ID
INTO #Temp1 FROM #Address -- 15,000,000
EXCEPT SELECT Address_ID FROM #Address01 -- 15,000,000
EXCEPT SELECT Address_ID FROM #Address02 -- 15,000,000
EXCEPT SELECT Address_ID FROM #Address03 -- 1,000000
EXCEPT SELECT Address_ID FROM #Address04 -- 1,000000
EXCEPT SELECT Address_ID FROM #Address05 -- 1,000000
EXCEPT SELECT Address_ID FROM #Address06 -- 1,000000
-- no hint: 00:00:37
-- OPTION(MERGE JOIN) -- 00:00:13
-- OPTION(HASH JOIN) -- 00:00:42
DROP TABLE #Temp2
SELECT Address_ID
INTO #Temp2 FROM #Address -- 15,000,000
EXCEPT SELECT Address_ID FROM #Address06 -- 1,000000
EXCEPT SELECT Address_ID FROM #Address05 -- 1,000000
EXCEPT SELECT Address_ID FROM #Address04 -- 1,000000
EXCEPT SELECT Address_ID FROM #Address03 -- 1,000000
EXCEPT SELECT Address_ID FROM #Address02 -- 15,000,000
EXCEPT SELECT Address_ID FROM #Address01 -- 15,000,000
-- no hint: -- 00:00:46
-- OPTION(MERGE JOIN) -- 00:00:16
-- OPTION(HASH JOIN) -- 00:01:05
Note that both of these queries, even without the hints, joined the tables in the same order that they appear in the query and used hash joins for the big tables and nested loops for the smaller tables.
I tested with different sizes of tables and the results were consistent - without hints, the optimiser joined the tables in the same order in which they appeared in the query, and joining in descending order of size always resulted in best performance. Also, a plan utilising merge joins never arose without the merge join hint, and with the hint, always performed better than the alternatives.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
February 16, 2016 at 10:03 am
Plan attached for tables ordered large to small in the "except" lines, with hint.
--50k in QA with tables large to small option merge join 02:34
-- 50k in QA with tables ordered large to small no hint 36 minutes -
-- original table order in prod result 500K IN 10 MINUTES
SELECT TOP (50000) X.ADDRESS_ID
FROM ( SELECT ADDRESS_ID FROM dbo.ADDRESS
EXCEPT SELECT ADDRESS_ID FROM LEGAL_ENTITY
EXCEPT SELECT ADDRESS_ID FROM LIENHOLDER
EXCEPT SELECT ADDRESS_ID FROM SERVICE_REQUEST
EXCEPT SELECT ADDRESS_ID FROM WORK_QUEUE_ITEM
EXCEPT SELECT ADDRESS_ID FROM PACKET
EXCEPT SELECT PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID FROM DIRECT_LENDING_SERVICE
EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI1
EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI2
EXCEPT SELECT ADDRESS_ID2 FROM EARLY_PAYOFF_INFO EPI3
EXCEPT SELECT ADDRESS_ID FROM CHECK_REQUEST_DEFAULT_ADDRESS) X(ADDRESS_ID)
OPTION(MERGE JOIN)
February 16, 2016 at 1:43 pm
ChrisM@Work (2/16/2016)
Orlando Colamatteo (2/15/2016)
Can you try it in order from most rows to least, see Jeff's follow-up post to mine. The thought being to reduce the set as much as possible as early as possible.You'd think the optimiser would rearrange the join order but it doesn't seem to, and the natural order in the query does have an influence as you'd expect.
My understanding is that order matters when using EXCEPT. In this case it does not (logically, as long as we have ADDRESS first) but the QO cannot know our intent.
IF OBJECT_ID(N'tempdb..#a') IS NOT NULL
DROP TABLE #a;
GO
CREATE TABLE #a (item VARCHAR(100));
CREATE TABLE #b (item VARCHAR(100));
INSERT INTO #a (item) VALUES ('Soup');
INSERT INTO #b (item) VALUES ('Soup');
INSERT INTO #b (item) VALUES ('Crackers');
SELECT item FROM #a
EXCEPT
SELECT item FROM #b;
SELECT item FROM #b
EXCEPT
SELECT item FROM #a;
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
February 16, 2016 at 2:12 pm
IndianRock,
Is it possible to record unique ADDRESS_ID's from deleted records of, say, table SERVICE_REQUEST when you clean it up?
Best way would be to use FOR DELETE triggers on the tables involved to copy ADDRESS_ID's from deleted records into a table AddressID_ToDelete (ADDRESS_ID int PRIMARY KEY).
But it's not the only method you can use, of course.
_____________
Code for TallyGenerator
February 16, 2016 at 3:10 pm
Try this one:
SELECT X.ADDRESS_ID
INTO #ToDelete
FROM dbo.ADDRESS A
WHERE NOT EXISTS (SELECT * FROM LEGAL_ENTITY L
WHERE L.ADDRESS_ID = A.ADDRESS_ID
)
IF EXISTS (SELECT * FROM #ToDelete )
DELETE A
FROM #ToDelete A
INNER JOIN SERVICE_REQUEST B ON B.ADDRESS_ID = A.ADDRESS_ID
IF EXISTS (SELECT * FROM #ToDelete )
DELETE A
FROM #ToDelete A
INNER JOIN LIENHOLDER B ON B.ADDRESS_ID = A.ADDRESS_ID
IF EXISTS (SELECT * FROM #ToDelete )
DELETE A
FROM #ToDelete A
INNER JOIN WORK_QUEUE_ITEM B ON B.ADDRESS_ID = A.ADDRESS_ID
IF EXISTS (SELECT * FROM #ToDelete )
DELETE A
FROM #ToDelete A
INNER JOIN PACKET B ON B.ADDRESS_ID = A.ADDRESS_ID
IF EXISTS (SELECT * FROM #ToDelete )
DELETE A
FROM #ToDelete A
INNER JOIN DIRECT_LENDING_SERVICE B ON B.PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID = A.ADDRESS_ID
IF EXISTS (SELECT * FROM #ToDelete )
DELETE A
FROM #ToDelete A
INNER JOIN CHECK_REQUEST_DEFAULT_ADDRESS B ON B.ADDRESS_ID = A.ADDRESS_ID
IF EXISTS (SELECT * FROM #ToDelete )
DELETE A
FROM #ToDelete A
INNER JOIN EARLY_PAYOFF_INFO B ON B.ADDRESS_ID1 = A.ADDRESS_ID OR B.ADDRESS_ID2 = A.ADDRESS_ID OR B.ADDRESS_ID3 = A.ADDRESS_ID
IF EXISTS (SELECT * FROM #ToDelete )
DELETE B
FROM #ToDelete A
INNER JOIN dbo.ADDRESS B ON B.ADDRESS_ID = A.ADDRESS_ID
It replaces 9 "fat joins" with just one.
The rest will be operating thousands of rows remaining in #ToDelete instead of millions of rows in the Address table.
If you find the way to record candidates for deletion (see my previous post) you'll be able to get rid of that last "fat join".
You might wish to limit the number of rows to be deleted by the last statement.
But most likely after couple of runs the number of records remaining in #ToDelete will never go over several hundreds, so it would not matter.
Try it on QA and see how it goes.
_____________
Code for TallyGenerator
Viewing 9 posts - 46 through 53 (of 53 total)
You must be logged in to reply to this topic. Login to reply