help with duplicates delete please

Question

Post reply

help with duplicates delete please

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply

george sibbald SSC Guru Points: 104210 More actions · Answer 1

the results are in:

straight delete took 8 min 47 secs and created 4GB of tran log

crawler delete took 21:49 at batches of 100,000 (40MB of log) and 9:22 at batches of 500,000 (320MB of log)

so batch interval is important! I will go for the crawler delete at 500,000 as it almost as quick and does not grow my log.

as it turned out my estimate of the number of duplicates was out (it was based on when the table ballooned). Rows before delete 6344074, rows after 2908! No wonder the optimiser chose a bad plan sometimes!

Of course will optimise table after delete (a quick dbreindex)

Presumably the select out into a temp table will work quick, but I will not use it for two reasons

a) this is a one off so 10 mins for delete and completed in one step good enough for me

b) I cannot get the select into to work - returrns error on execution (not parse) of - 'No column was specified for column 1 of 'd'.' (or #keptids)

---------------------------------------------------------------------

Michael Meierruth SSChampion Points: 10051 More actions · Answer 2

George,

Glad to hear this good news.

So what was your recovery mode setting?

Surely selecting the good records (2908 out of 6M) into a table, truncating and going back would have been infinitely faster.

Precisely which 'select into' can't you get working?

Jeff,

What do you make of this? George deleted 99.96% from a table holding 6.3 million rows - all in one go and in less than 9 minutes. Can you reproduce this? Or rather, any ideas on where this performance issue you mentioned is coming from?

george sibbald SSC Guru Points: 104210 More actions · Answer 3

Michael,

the db was in simple mode, I monitored logspace used to see how large it got.

I would expect the select to be fast, but it does have the disadvantage of being a multi stage process (select into, drop original table, rename new table, build indexes). I was happy with what I had so stuck with it and as I said could not get select into to work in time. both selcets posted failed with same error:

SELECT MIN(RecordID)

INTO #KeptIDs

FROM AssociationLinks

GROUP BY

WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus

and

SELECT d.*

INTO new_table_name

FROM (select min(RecordId)

from dbo.AssociationLinks

group by WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus)d

INNER JOIN dbo.AccociationLinks al

ON al.RecordID = d.RecordID

parse works but errors on execution

appears to be it does not have a matching no. of columns in the tables?

---------------------------------------------------------------------

george sibbald SSC Guru Points: 104210 More actions · Answer 4

oh and amount of log was per batch.....

---------------------------------------------------------------------

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 5

Michael Meierruth (5/19/2008)
Jeff,
What do you make of this? George deleted 99.96% from a table holding 6.3 million rows - all in one go and in less than 9 minutes. Can you reproduce this? Or rather, any ideas on where this performance issue you mentioned is coming from?

Dunno for sure... on the production box at work (2k, sp4, 8 CPU's, 8 Gig of ram, probably too many indexes), a delete of over 3 million rows took nearly a day... delete of 2 million rows took minutes. Similar deletes continue to have the same problem and a crawler is necessary. It may have something to do with the way the harddisks are setup but I'm not privy to that setup. The other thing is that our production box is set to full recovery, not simple recovery like George has (if I read that right).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 6

george sibbald (5/19/2008)
the results are in:
straight delete took 8 min 47 secs and created 4GB of tran log
crawler delete took 21:49 at batches of 100,000 (40MB of log) and 9:22 at batches of 500,000 (320MB of log)
so batch interval is important! I will go for the crawler delete at 500,000 as it almost as quick and does not grow my log.
as it turned out my estimate of the number of duplicates was out (it was based on when the table ballooned). Rows before delete 6344074, rows after 2908! No wonder the optimiser chose a bad plan sometimes!
Of course will optimise table after delete (a quick dbreindex)
Presumably the select out into a temp table will work quick, but I will not use it for two reasons
a) this is a one off so 10 mins for delete and completed in one step good enough for me
b) I cannot get the select into to work - returrns error on execution (not parse) of - 'No column was specified for column 1 of 'd'.' (or #keptids)

George,

Would you post the "straight delete" code... I'd like to find out why it went so quickly. I also want to test it against the production boxes at work because they seem to have a problem with deletes of about 3 million rows.

Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 7

Michael Meierruth (5/18/2008)
First of all, the delete statement I posted has a slight problem. It works correctly for the duplicates and would delete the duplicates and leave only one row with the lowest RecordId. But at the same time it would also delete all the unique rows. Thus the having-clause must be removed for this delete statement - as Jeff noted - sort of.
Jeff, as for the performance issue when deleting a large subset of rows from a large table, I cannot at all confirm this. I really don't understand why. Thus I ran your code and generated 5 million rows followed by doing a
delete from JBMTestDetail where Time_Stamp>='20060101' and Time_Stamp<='20131231'
which deletes 80% of the rows in about 6 minutes.
I also generated random data in George's table with 6 million rows of which 87% are duplicates and my delete statement takes 3.5 minutes. By changing the sub-select to a real table with a unique index I can get it down to 2.5 minutes.
I tried this on SS2000PE/SP4 and SS2005SE.
By the way, on SS2005 you must not have a unique clustered index on your table, otherwise generating data using newid() produces the same value for all columns of each row. Thus try the following:

Nicely done... I can't get that kind of performance out of the production box at work... I don't know if it's because it's in the rull recovery mode or not or if there's a disk problem or something else like maybe the width of a row. Thanks for posting your stats... you've helped me uncover a possible production problem.

I'll do some more testing and let you know.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

janine.rawnsley Ten Centuries Points: 1056 More actions · Answer 8

george sibbald (5/19/2008)
b) I cannot get the select into to work - returrns error on execution (not parse) of - 'No column was specified for column 1 of 'd'.' (or #keptids)

Woops.. that's the trouble with being unable to run the code to test for syntax issues.. Fixed (see edited post)

Personally I like this method, it's fast and I always feel safer knowing I'm snapshotting the rows I want to keep..

george sibbald SSC Guru Points: 104210 More actions · Answer 9

Jeff, 'straight' delete code -

delete from AssociationLinks

where RecordId not in

(

select min(RecordId)

from AssociationLinks

group by WorkspaceEntityId,SourceEntityId,TargetEntityId,Name,RecordStatus)

Janine - fix worked. could you explain to me why and what that error message had to do with requiring the as clause? Your code wins the speed test (not surprisingly) extract out of min recordids took 10 seconds. populate of temp table based on min recordids instantaneous.

These runs have been done on my UAT server. SQL 2000 STD SP4 (so only 2GB mem available to SQL), 3.6GB RAM 2 CPU hyperthreaded. Data , log and temp separated on raid 5 devices. Server not heavily loaded.

UAT DB in simple mode but prod will be in Full mode. I do not see how this would make any difference to performance as whatever mode exactly the same amount of logging is done, its just when it gets truncated. If you do a log backup whilst delete running may be a small hit.

I did have to do a large crawler delete recently, deleting 600 million rows out of a 1.5 billion row 450GB table. Batched up in 600,000 row chunks I got 7 - 8 million rows an hour out of that. DB was in simple mode. Same server spec as above but 4GB memory and 4 CPU hyperthreaded. The server is IO bound so Jeff I would get your IO stats checked first (avg sec\read, avg sec\write, avd disk q read and write lengths.)

If you want to test against that schema is

REATE TABLE [dbo].[SC_SampledNumericDataFact_Table] (

[Computer_FK] [bigint] NOT NULL ,

[ConfigurationGroup_FK] [bigint] NOT NULL ,

[CounterDetail_FK] [bigint] NOT NULL ,

[DateSampled_FK] [bigint] NOT NULL ,

[DateTimeAdded] [datetime] NOT NULL ,

[DateTimeSampled] [datetime] NOT NULL ,

[LocalDateSampled_FK] [bigint] NOT NULL ,

[LocalDateTimeSampled] [datetime] NOT NULL ,

[LocalTimeSampled_FK] [bigint] NOT NULL ,

[SampleValue] [float] NOT NULL ,

[SMC_InstanceID] [bigint] IDENTITY (1, 1) NOT NULL ,

[TimeSampled_FK] [bigint] NOT NULL

) ON [PRIMARY]

GO

CREATE CLUSTERED INDEX [SampledNumericDataFact_LocalDateTimeSampled_ClusteredIndex] ON [dbo].[SC_SampledNumericDataFact_Table]([LocalDateTimeSampled]) WITH FILLFACTOR = 98 ON [PRIMARY]

GO

CREATE INDEX [SampledNumericDataFact_ComputerFK_Index] ON [dbo].[SC_SampledNumericDataFact_Table]([Computer_FK]) WITH FILLFACTOR = 98 ON [PRIMARY]

GO

CREATE INDEX [SampledNumericDataFact_CounterDetailFK_Index] ON [dbo].[SC_SampledNumericDataFact_Table]([CounterDetail_FK]) WITH FILLFACTOR = 98 ON [PRIMARY]

GO

CREATE INDEX [SampledNumericDataFact_ConfigurationGroupFK_DateTimeAdded_Index] ON [dbo].[SC_SampledNumericDataFact_Table]([ConfigurationGroup_FK], [DateTimeAdded] DESC ) WITH FILLFACTOR = 98 ON [PRIMARY]

GO

---------------------------------------------------------------------

janine.rawnsley Ten Centuries Points: 1056 More actions · Answer 10

When creating tables on the fly using INTO you have to specify column names. You'd think it'd be intelligent enough to know that max(RecordID) should be a col called RecordID but alas.. no.