purging old data and outer joins

Question

Post reply

purging old data and outer joins

Viewing 9 posts - 46 through 53 (of 53 total)

You must be logged in to reply to this topic. Login to reply

Indianrock SSC-Insane Points: 20333 More actions · Answer 1

500k in 10 minutes on production. Sqlplan attached

SELECT TOP (500000) X.ADDRESS_ID

FROM ( SELECT ADDRESS_ID FROM dbo.ADDRESS

EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI1

EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI2

EXCEPT SELECT ADDRESS_ID2 FROM EARLY_PAYOFF_INFO EPI3

EXCEPT SELECT ADDRESS_ID FROM CHECK_REQUEST_DEFAULT_ADDRESS

EXCEPT SELECT PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID FROM DIRECT_LENDING_SERVICE

EXCEPT SELECT ADDRESS_ID FROM PACKET

EXCEPT SELECT ADDRESS_ID FROM WORK_QUEUE_ITEM

EXCEPT SELECT ADDRESS_ID FROM LEGAL_ENTITY

EXCEPT SELECT ADDRESS_ID FROM SERVICE_REQUEST

EXCEPT SELECT ADDRESS_ID FROM LIENHOLDER) X(ADDRESS_ID);

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 2

Can you try it in order from most rows to least, see Jeff's follow-up post to mine. The thought being to reduce the set as much as possible as early as possible.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 3

Sergiy (2/14/2016)
Jeff Moden (2/14/2016)
Sergiy (2/14/2016)
I wonder - how long the SELECT part of the query would take?
The DELETE part would probably take as long as not doing any deletes and would probably delete the same number of rows... none. Mission accomplished! 😀
Why do you think it's gonna be none rows deleted?

I could be wrong but it looks as if all you need is one address ID in the other tables that match the first address ID in the address table and nothing will be deleted.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sergiy SSC Guru Points: 110208 More actions · Answer 4

Jeff Moden (2/15/2016)
Sergiy (2/14/2016)
Jeff Moden (2/14/2016)
Sergiy (2/14/2016)
I wonder - how long the SELECT part of the query would take?
The DELETE part would probably take as long as not doing any deletes and would probably delete the same number of rows... none. Mission accomplished! 😀
Why do you think it's gonna be none rows deleted?
I could be wrong but it looks as if all you need is one address ID in the other tables that match the first address ID in the address table and nothing will be deleted.

No, you're not wrong. (as always, b...y h..l!) 🙂

But look at the data.

500 mil addresses stored in the database.

USPS delivers to 154 mil addresses.

So, the client is 3 times bigger than US.

Who's bigger?

China?

I doubt Chinese would outsource a database of a national scale to India.

India?

IndianRock said they are an outsource company, so, it's not India.

Only one conclusion left - this is not normalised data storage.

Every new case adds a record of a new client, its lender, associated legal entity, etc.

Every of those records has an address which is recorded as an add-hoc set of strings.

In attempt to simulate data normalisation the developers put all addresses into a single table, but all those 9 addresses create a new record in the table no matter what.

The lending agency most likely did not move once on a course of the last 10 years, but the application adds a new record for its address every time it's mentioned in a new case.

That's the only way I can explain such a number of addresses in the car dealership database.

Now, what's happened when a case expires?

They remove it from the database.

And they have to remove all the records associated with that case.

Because they all are one-off, and they cannot be reused in other cases.

Logically, cases are removed in chronological order, the oldest first.

Oldest case created addresses with the least ADDRESS_ID.

When you remove it the MIN used ADDRESS_ID moves up, and my DELETE will find some work to do.

Yes, I admit, there might be some exclusions, some jumps in the queue.

Some addresses may be late for deletion for a week or 2.

But they'll catch up pretty soon, I'm sure.

IndianRock - how far is my guess from the actual state of affairs?

_____________
Code for TallyGenerator

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 5

Orlando Colamatteo (2/15/2016)
Can you try it in order from most rows to least, see Jeff's follow-up post to mine. The thought being to reduce the set as much as possible as early as possible.

You'd think the optimiser would rearrange the join order but it doesn't seem to, and the natural order in the query does have an influence as you'd expect.

I set up some sample data like this:

SELECT TOP 15000000

Address_ID = ID

INTO #Address

FROM dbo.SomeTableWithAUniqueIDColumn

ORDER BY NEWID()

CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #Address (Address_ID)

with another two tables having the same number of rows captured at random, plus another four tables with a million rows captured at random (from 21 million rows in the permanent table).

Then I ran these two queries and recorded the best of three executions for each of the three conditions:

DROP TABLE #Temp1

SELECT Address_ID

INTO #Temp1 FROM #Address -- 15,000,000

EXCEPT SELECT Address_ID FROM #Address01 -- 15,000,000

EXCEPT SELECT Address_ID FROM #Address02 -- 15,000,000

EXCEPT SELECT Address_ID FROM #Address03 -- 1,000000

EXCEPT SELECT Address_ID FROM #Address04 -- 1,000000

EXCEPT SELECT Address_ID FROM #Address05 -- 1,000000

EXCEPT SELECT Address_ID FROM #Address06 -- 1,000000

-- no hint: 00:00:37

-- OPTION(MERGE JOIN) -- 00:00:13

-- OPTION(HASH JOIN) -- 00:00:42

DROP TABLE #Temp2

SELECT Address_ID

INTO #Temp2 FROM #Address -- 15,000,000

EXCEPT SELECT Address_ID FROM #Address06 -- 1,000000

EXCEPT SELECT Address_ID FROM #Address05 -- 1,000000

EXCEPT SELECT Address_ID FROM #Address04 -- 1,000000

EXCEPT SELECT Address_ID FROM #Address03 -- 1,000000

EXCEPT SELECT Address_ID FROM #Address02 -- 15,000,000

EXCEPT SELECT Address_ID FROM #Address01 -- 15,000,000

-- no hint: -- 00:00:46

-- OPTION(MERGE JOIN) -- 00:00:16

-- OPTION(HASH JOIN) -- 00:01:05

Note that both of these queries, even without the hints, joined the tables in the same order that they appear in the query and used hash joins for the big tables and nested loops for the smaller tables.

I tested with different sizes of tables and the results were consistent - without hints, the optimiser joined the tables in the same order in which they appeared in the query, and joining in descending order of size always resulted in best performance. Also, a plan utilising merge joins never arose without the merge join hint, and with the hint, always performed better than the alternatives.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Indianrock SSC-Insane Points: 20333 More actions · Answer 6

Plan attached for tables ordered large to small in the "except" lines, with hint.

--50k in QA with tables large to small option merge join 02:34

-- 50k in QA with tables ordered large to small no hint 36 minutes -

-- original table order in prod result 500K IN 10 MINUTES

SELECT TOP (50000) X.ADDRESS_ID

FROM ( SELECT ADDRESS_ID FROM dbo.ADDRESS

EXCEPT SELECT ADDRESS_ID FROM LEGAL_ENTITY

EXCEPT SELECT ADDRESS_ID FROM LIENHOLDER

EXCEPT SELECT ADDRESS_ID FROM SERVICE_REQUEST

EXCEPT SELECT ADDRESS_ID FROM WORK_QUEUE_ITEM

EXCEPT SELECT ADDRESS_ID FROM PACKET

EXCEPT SELECT PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID FROM DIRECT_LENDING_SERVICE

EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI1

EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI2

EXCEPT SELECT ADDRESS_ID2 FROM EARLY_PAYOFF_INFO EPI3

EXCEPT SELECT ADDRESS_ID FROM CHECK_REQUEST_DEFAULT_ADDRESS) X(ADDRESS_ID)

OPTION(MERGE JOIN)

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 7

ChrisM@Work (2/16/2016)
Orlando Colamatteo (2/15/2016)
Can you try it in order from most rows to least, see Jeff's follow-up post to mine. The thought being to reduce the set as much as possible as early as possible.
You'd think the optimiser would rearrange the join order but it doesn't seem to, and the natural order in the query does have an influence as you'd expect.

My understanding is that order matters when using EXCEPT. In this case it does not (logically, as long as we have ADDRESS first) but the QO cannot know our intent.

IF OBJECT_ID(N'tempdb..#a') IS NOT NULL

DROP TABLE #a;

GO

CREATE TABLE #a (item VARCHAR(100));

CREATE TABLE #b (item VARCHAR(100));

INSERT INTO #a (item) VALUES ('Soup');

INSERT INTO #b (item) VALUES ('Soup');

INSERT INTO #b (item) VALUES ('Crackers');

SELECT item FROM #a

EXCEPT

SELECT item FROM #b;

SELECT item FROM #b

EXCEPT

SELECT item FROM #a;

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Sergiy SSC Guru Points: 110208 More actions · Answer 8

IndianRock,

Is it possible to record unique ADDRESS_ID's from deleted records of, say, table SERVICE_REQUEST when you clean it up?

Best way would be to use FOR DELETE triggers on the tables involved to copy ADDRESS_ID's from deleted records into a table AddressID_ToDelete (ADDRESS_ID int PRIMARY KEY).

But it's not the only method you can use, of course.

_____________
Code for TallyGenerator

Sergiy SSC Guru Points: 110208 More actions · Answer 9

Try this one:

SELECT X.ADDRESS_ID

INTO #ToDelete

FROM dbo.ADDRESS A

WHERE NOT EXISTS (SELECT * FROM LEGAL_ENTITY L

WHERE L.ADDRESS_ID = A.ADDRESS_ID

)