December 30, 2013 at 2:14 pm
I have an issue with a store procedure that takes roughly 25 minutes to complete. the table that this store procedure is running against has rougly 12 million records. I am not sure if I have coded this in a most efficient way. I am thinking that even though this works, I might need to consider creating a different route altogether.
Here is the SP:
USE [FIS2]
GO
/****** Object: StoredProcedure [dbo].[admin_DeleteE7TimeClockDetailDuplicates] Script Date: 12/30/2013 16:07:26 ******/
SET ANSI_NULLS OFF
GO
SET QUOTED_IDENTIFIER OFF
GO
ALTER PROCEDURE [dbo].[admin_DeleteE7TimeClockDetailDuplicates] AS
--If Time Clock Details Table doesn't have the number of columns we're expecting, then abort
DECLARE @ColumnCount INT
SET @ColumnCount = (SELECT COUNT(column_name) FROM information_schema.columns WHERE table_name = 'ps_micros_e7_time_clock_details' AND column_name NOT LIKE 'keyCol')
IF @ColumnCount <> 37 -- As of 5/21/10
BEGIN
SELECT 'TABLE ps_micros_e7_time_clock_details appears to have changed ... unable to remove duplicates!' AS Error
RETURN 1
END
IF EXISTS (SELECT * FROM tempdb.dbo.sysobjects WHERE xtype = 'U' AND id = object_id( N'tempdb..#StoreOIDs')) DROP TABLE #StoreOIDs
DECLARE @PollDate AS VARCHAR(10) SET @PollDate = CONVERT(VARCHAR(10), GETDATE(), 101)
DECLARE @StoreOID AS VARCHAR(50)
SELECT DISTINCT store_oid INTO #StoreOIDs FROM ps_micros_e7_time_clock_details ORDER BY store_oid
WHILE (SELECT COUNT(store_OID) FROM #StoreOIDs) > 0
BEGIN
SET @StoreOID = (SELECT TOP 1 store_oid FROM #StoreOIDs)
IF EXISTS (SELECT * FROM tempdb.dbo.sysobjects WHERE xtype = 'U' AND id = object_id( N'tempdb..#StoreRecs')) DROP TABLE #StoreRecs
BEGIN TRANSACTION
-- Select out all distinct records for a given store into a temp table
SELECT DISTINCT store_oid, Seq, EmplSeq, JobSeq, OvertimeRuleSeq,
ReasonDefSeq, ClockInStatus, ClockInTimeUtc, ClockOutStatus, ClockOutTimeUtc,
InAdjustEmplSeq, OutAdjustEmplSeq, RegularSeconds, Overtime1Seconds,
Overtime2Seconds, Overtime3Seconds, Overtime4Seconds, AccumulatedDailySeconds,
AccumulatedPeriodSeconds, RegularPay, Overtime1Pay, Overtime2Pay,
Overtime3Pay, Overtime4Pay, RegularRate, AccumulatedDays, ConsecutiveDays,
Computed, BreakSeconds, PaidBreakSeconds, PaidBreakPay, ReportingTimeSeconds, ReportingTimePay, AdjustedClockInTime, AdjustedClockOutTime
INTO #StoreRecs
FROM ps_micros_e7_time_clock_details
WHERE store_oid = @StoreOID
IF @@ERROR <> 0 GOTO ERR_HANDLER
-- Delete all records for this store from time clock details table
DELETE FROM ps_micros_e7_time_clock_details WHERE store_oid = @StoreOID
IF @@ERROR <> 0 GOTO ERR_HANDLER
-- Insert distinct records back in for the given store
INSERT INTO ps_micros_e7_time_clock_details
SELECT store_oid, @PollDate AS pollDate, '' AS batchId, Seq, EmplSeq, JobSeq, OvertimeRuleSeq,
ReasonDefSeq, ClockInStatus, ClockInTimeUtc, ClockOutStatus, ClockOutTimeUtc,
InAdjustEmplSeq, OutAdjustEmplSeq, RegularSeconds, Overtime1Seconds,
Overtime2Seconds, Overtime3Seconds, Overtime4Seconds, AccumulatedDailySeconds,
AccumulatedPeriodSeconds, RegularPay, Overtime1Pay, Overtime2Pay,
Overtime3Pay, Overtime4Pay, RegularRate, AccumulatedDays, ConsecutiveDays,
Computed, BreakSeconds, PaidBreakSeconds, PaidBreakPay, ReportingTimeSeconds, ReportingTimePay, AdjustedClockInTime, AdjustedClockOutTime
FROM #StoreRecs
IF @@ERROR <> 0 GOTO ERR_HANDLER
DELETE FROM #StoreOIDs WHERE store_oid = @StoreOID
IF @@ERROR <> 0 GOTO ERR_HANDLER
COMMIT TRANSACTION
END
RETURN 0
ERR_HANDLER:
SELECT 'Unexpected error occurred!'
ROLLBACK TRANSACTION
RETURN 1
Any suggestions?
thanks in advance
PC
December 30, 2013 at 2:31 pm
You don't need to go through each store with a while loop. You can remove all duplicates at a time. There are several ways to do this and the best for one scenario might not be the best for another one.
I have some questions for you:
why do you get duplicates in the first place?
How often will you run this SP?
Does your table have a natural key to identify any duplicates instead of having to compare all rows?
Could you share table and indexes definitions?
December 30, 2013 at 2:40 pm
Hello luis,
thank you very much for your quick reply. I kind of raised the same questions. I have just started in this job a week ago and this is one of my first tasks. I have rewritten this SP so it actually works and gets rid of the duplicates, however the performance is terrible.
Here are the answers:
1. We get duplicates but there is an automated polling process once a day from the store and the POS machines do not have ability to send just a last day of transactions. In other words, every day we get all of the data as of right now. I would love to change that once I have better understanding how the POS polling works.
2. We run the SP every 4 hours for now.
3. The table as of right now does not have any unique key or primary key. That is a big no no for me. It was done by the previous developer.
4. CREATE TABLE [dbo].[ps_micros_e7_time_clock_details](
[store_oid] [varchar](50) NULL,
[pollDate] [datetime] NULL,
[batchId] [varchar](100) NULL,
[Seq] [varchar](25) NULL,
[EmplSeq] [varchar](25) NULL,
[JobSeq] [varchar](25) NULL,
[OvertimeRuleSeq] [varchar](25) NULL,
[ReasonDefSeq] [varchar](25) NULL,
[ClockInStatus] [varchar](25) NULL,
[ClockInTimeUtc] [datetime] NULL,
[ClockOutStatus] [varchar](25) NULL,
[ClockOutTimeUtc] [datetime] NULL,
[InAdjustEmplSeq] [varchar](25) NULL,
[OutAdjustEmplSeq] [varchar](25) NULL,
[RegularSeconds] [varchar](25) NULL,
[Overtime1Seconds] [varchar](25) NULL,
[Overtime2Seconds] [varchar](25) NULL,
[Overtime3Seconds] [varchar](25) NULL,
[Overtime4Seconds] [varchar](25) NULL,
[AccumulatedDailySeconds] [varchar](25) NULL,
[AccumulatedPeriodSeconds] [varchar](25) NULL,
[RegularPay] [varchar](25) NULL,
[Overtime1Pay] [varchar](25) NULL,
[Overtime2Pay] [varchar](25) NULL,
[Overtime3Pay] [varchar](25) NULL,
[Overtime4Pay] [varchar](25) NULL,
[RegularRate] [varchar](25) NULL,
[AccumulatedDays] [varchar](25) NULL,
[ConsecutiveDays] [varchar](25) NULL,
[Computed] [varchar](25) NULL,
[BreakSeconds] [int] NOT NULL,
[PaidBreakSeconds] [int] NOT NULL,
[PaidBreakPay] [varchar](25) NULL,
[ReportingTimeSeconds] [int] NOT NULL,
[ReportingTimePay] [varchar](25) NULL,
[AdjustedClockInTime] [datetime] NULL,
[AdjustedClockOutTime] [datetime] NULL
) ON [PRIMARY]
December 30, 2013 at 2:45 pm
Forgot to answer the indexes questions:
2 indexes set up - both Non-unique, Non-clustered
CREATE NONCLUSTERED INDEX [idx_e7_time_clock_details_startbusdate] ON [dbo].[ps_micros_e7_time_clock_details]
([ClockInTimeUtc] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
and
CREATE NONCLUSTERED INDEX [idx_e7_time_clock_details_store_oid] ON [dbo].[ps_micros_e7_time_clock_details]
([store_oid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
December 30, 2013 at 2:50 pm
In pieces:
--If Time Clock Details Table doesn't have the number of columns we're expecting, then abort
DECLARE @ColumnCount INT
SET @ColumnCount = (SELECT COUNT(column_name) FROM information_schema.columns WHERE table_name = 'ps_micros_e7_time_clock_details' AND column_name NOT LIKE 'keyCol')
IF @ColumnCount <> 37 -- As of 5/21/10
BEGIN
SELECT 'TABLE ps_micros_e7_time_clock_details appears to have changed ... unable to remove duplicates!' AS Error
RETURN 1
END
I'm not sure if I want to applaud your thoroughness or am scared that schema changes can be pushed through without a code review/regression test. I shall do both.
DECLARE @PollDate AS VARCHAR(10)
SET @PollDate = CONVERT(VARCHAR(10), GETDATE(), 101)
DECLARE @StoreOID AS VARCHAR(50)
Faster date conversion code for reference. It won't make enough of a difference here but if you're trying to report or convert 1mill+ it does: DATEADD( dd, DATEDIFF( dd, 0, GETDATE()), 0)
SELECT DISTINCT store_oid INTO #StoreOIDs FROM ps_micros_e7_time_clock_details ORDER BY store_oid
WHILE (SELECT COUNT(store_OID) FROM #StoreOIDs) > 0
BEGIN
You have just created a cursor. You could as easily create a fast_foward, read_only cursor here. SQL is set based, not loop based. Do you know the original reason this was coded this way?
SELECT DISTINCT store_oid, Seq, EmplSeq, JobSeq, OvertimeRuleSeq,
ReasonDefSeq, ClockInStatus, ClockInTimeUtc, ClockOutStatus, ClockOutTimeUtc,
InAdjustEmplSeq, OutAdjustEmplSeq, RegularSeconds, Overtime1Seconds,
Overtime2Seconds, Overtime3Seconds, Overtime4Seconds, AccumulatedDailySeconds,
AccumulatedPeriodSeconds, RegularPay, Overtime1Pay, Overtime2Pay,
Overtime3Pay, Overtime4Pay, RegularRate, AccumulatedDays, ConsecutiveDays,
Computed, BreakSeconds, PaidBreakSeconds, PaidBreakPay, ReportingTimeSeconds, ReportingTimePay, AdjustedClockInTime, AdjustedClockOutTime
INTO #StoreRecs
FROM ps_micros_e7_time_clock_details
WHERE store_oid = @StoreOID
IF @@ERROR <> 0 GOTO ERR_HANDLER
-- Delete all records for this store from time clock details table
DELETE FROM ps_micros_e7_time_clock_details WHERE store_oid = @StoreOID
IF @@ERROR <> 0 GOTO ERR_HANDLER
-- Insert distinct records back in for the given store
INSERT INTO ps_micros_e7_time_clock_details
SELECT store_oid, @PollDate AS pollDate, '' AS batchId, Seq, EmplSeq, JobSeq, OvertimeRuleSeq,
ReasonDefSeq, ClockInStatus, ClockInTimeUtc, ClockOutStatus, ClockOutTimeUtc,
InAdjustEmplSeq, OutAdjustEmplSeq, RegularSeconds, Overtime1Seconds,
Overtime2Seconds, Overtime3Seconds, Overtime4Seconds, AccumulatedDailySeconds,
AccumulatedPeriodSeconds, RegularPay, Overtime1Pay, Overtime2Pay,
Overtime3Pay, Overtime4Pay, RegularRate, AccumulatedDays, ConsecutiveDays,
Computed, BreakSeconds, PaidBreakSeconds, PaidBreakPay, ReportingTimeSeconds, ReportingTimePay, AdjustedClockInTime, AdjustedClockOutTime
FROM #StoreRecs
IF @@ERROR <> 0 GOTO ERR_HANDLER
COMMIT TRANSACTION
This is going from the US to Canada via Mexico. Is there a PK for row recognition on ps_micros_e7_time_clock_details?
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 30, 2013 at 2:59 pm
If there's no primary key, then performance won't be great. Even if there's no PK defined as a constraint, you could identify a PK that will help you identify duplicates.
Here's an option to eliminate the duplicates, but it might not be better.
BEGIN TRY
BEGIN TRANSACTION
IF OBJECT_ID(N'tempdb..#StoreRecs') > 0
DROP TABLE #StoreRecs
SELECT DISTINCT store_oid,
Seq,
EmplSeq,
JobSeq,
OvertimeRuleSeq,
ReasonDefSeq,
ClockInStatus,
ClockInTimeUtc,
ClockOutStatus,
ClockOutTimeUtc,
InAdjustEmplSeq,
OutAdjustEmplSeq,
RegularSeconds,
Overtime1Seconds,
Overtime2Seconds,
Overtime3Seconds,
Overtime4Seconds,
AccumulatedDailySeconds,
AccumulatedPeriodSeconds,
RegularPay,
Overtime1Pay,
Overtime2Pay,
Overtime3Pay,
Overtime4Pay,
RegularRate,
AccumulatedDays,
ConsecutiveDays,
Computed,
BreakSeconds,
PaidBreakSeconds,
PaidBreakPay,
ReportingTimeSeconds,
ReportingTimePay,
AdjustedClockInTime,
AdjustedClockOutTime
INTO #StoreRecs
FROM ps_micros_e7_time_clock_details
TRUNCATE TABLE ps_micros_e7_time_clock_details
-- Insert distinct records back in for the given store
INSERT INTO ps_micros_e7_time_clock_details
SELECT store_oid,
@PollDate AS pollDate,
'' AS batchId,
Seq,
EmplSeq,
JobSeq,
OvertimeRuleSeq,
ReasonDefSeq,
ClockInStatus,
ClockInTimeUtc,
ClockOutStatus,
ClockOutTimeUtc,
InAdjustEmplSeq,
OutAdjustEmplSeq,
RegularSeconds,
Overtime1Seconds,
Overtime2Seconds,
Overtime3Seconds,
Overtime4Seconds,
AccumulatedDailySeconds,
AccumulatedPeriodSeconds,
RegularPay,
Overtime1Pay,
Overtime2Pay,
Overtime3Pay,
Overtime4Pay,
RegularRate,
AccumulatedDays,
ConsecutiveDays,
Computed,
BreakSeconds,
PaidBreakSeconds,
PaidBreakPay,
ReportingTimeSeconds,
ReportingTimePay,
AdjustedClockInTime,
AdjustedClockOutTime
FROM #StoreRecs
DROP TABLE #StoreRecs
COMMIT TRANSACTION
END TRY
BEGIN CATCH
SELECT 'Unexpected error occurred!'
ROLLBACK TRANSACTION
END CATCH
December 30, 2013 at 3:01 pm
Evil Craig,
absolutely love your sense of humor and your comments.
On the first issue: 1. There is a very distinct reason for this piece of code and I have no control of it. Understand your point though. I would not allow schema changes to be pushed through. There is an POS system that is very quirky and this validation was added by me to address a particular problem with the device that the company had trouble with in the past.
2. Very good point. The code right now is running through 12 million rows (growing daily) so it will make a difference for sure. I don't like the fact that it is looping through all records and not just the duplicates. Without a unique key, I have a hard time figuring out a better way.
3. No idea on why this was done this way. Open to suggestions on how to rewrite it.
4. The PK does not exist. Not sure why it was done this way and don't like it. Again open to better design suggestions.
December 30, 2013 at 3:05 pm
This is definitely a better way of doing it. However, I am not sure about the TRUNCATE TABLE ps_micros_e7_time_clock_details statement. I don't want/need to truncate the table altogether as there are stores that don't have any duplicates.
thanks,
Petr
December 30, 2013 at 3:21 pm
petr.caslavka (12/30/2013)
This is definitely a better way of doing it. However, I am not sure about the TRUNCATE TABLE ps_micros_e7_time_clock_details statement. I don't want/need to truncate the table altogether as there are stores that don't have any duplicates.thanks,
Petr
The current process is looping through all the stores and deleting the information from one store at a time. In the end is doing the same thing but with all stores at once instead of doing it one by one.
December 30, 2013 at 4:02 pm
petr.caslavka (12/30/2013)
Evil Craig,absolutely love your sense of humor and your comments.
Well, thank you. ๐
On the first issue: 1. There is a very distinct reason for this piece of code and I have no control of it. Understand your point though. I would not allow schema changes to be pushed through. There is an POS system that is very quirky and this validation was added by me to address a particular problem with the device that the company had trouble with in the past.
Well, fair enough. Just... wow. Fearsome.
2. Very good point. The code right now is running through 12 million rows (growing daily) so it will make a difference for sure. I don't like the fact that it is looping through all records and not just the duplicates. Without a unique key, I have a hard time figuring out a better way.
Yeah, in your SELECT statement I'd worry. Setting a variable? Meh, certainly not worth a code change. Just worth mentioning.
3. No idea on why this was done this way. Open to suggestions on how to rewrite it.
4. The PK does not exist. Not sure why it was done this way and don't like it. Again open to better design suggestions.
Okay, is it possible to slap an identity column on this table simply for row recognition as the PK? Also, are you allowed to recommend a clustered index on... well, ANYTHING... for this table? MS SQL and Heaps are not friendly without a signficant review of the benefits and costs and coding exclusively towards the benefits... and that's pretty much a never in a reporting system, which is what this sounds like (since the OLTP system is actually your POS system you import from).
Basically, the only thing you have to work with is the RID (RowIDs) under the two non-clustered indexes. If this system is under load, I could see why they're looping the store_oid simply because that's the only really useful splitter they have to keep memory consumption and I/O load down to reasonable for each pass.
My personal thoughts on this is re-evaluating from the ground up. Here's how I would approach this.
1) Get a key on e7_time_clock... you HAVE to be able to identify a row specifically, particularly in a load region that can dupe. You need something that the loads can't control for this because they're duping the business keys, thus, identity column.
2) Get a clustered index on same table. If it's the Identity column you should STILL see some improvements, but by preference you'll have a nonclustered key on the identity and a clustered index on how this table is usually connected to other ones for reference.
3) Re-work the import process. Stage the inbound data and reject duplications to avoid the issue entirely. MERGE is great for this. It'll still be slow but at least you'll have a fighting chance without massive TABX locks due to the mass deletions.
4) If this proc needs to continue to exist even after a staging control for duplicate removals, adjust the procedure to do a single deletion pass, not an insert/delete/insert construct. How you'd do it is you recognize the duplications by adding a row_number() function, group it on all the columns except the surrogate key (identity), and any time the number is greater than 1 you have a deletion candidate. CTE the row_number pull, join it to the main table on the key, and remove dupes that way.
5) If #3 is not available, include a second column to the table (besides the identity) for timestamp. The timestamp column isn't a datetime, it's actually an ever increasing value for when you update any row... or insert one. By using that in conjuction with storing the maximum timestamp value from the last run, you can limit what you have to check for duplications instead of having to run the entire table in one shot.
Basically, all of this is trying to get away from loading the entire table into memory to start performing work. There's no way out of the pain of the time this will take until you're allowed to either adjust the methodology or the schema in significant ways.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 30, 2013 at 10:18 pm
petr.caslavka (12/30/2013)
This is definitely a better way of doing it. However, I am not sure about the TRUNCATE TABLE ps_micros_e7_time_clock_details statement. I don't want/need to truncate the table altogether as there are stores that don't have any duplicates.thanks,
Petr
Yeah, I don't really get this statement. You were basically deleting it all, but slower, in the original query. Also, you are overwriting values of poll_date and batch_id, so I am that even for the non-duplicates these could have erroneous values you want to overwrite.
Truncate and reload is likely to be the quickest approach if you can't follow Craig's advice on adding a unique identifier.
If you don't need to update the singleton rows at all, and expect a very low # of the total records to be duplicates, you could try replacing the distinct from the select into a temp table with a group by all the fields, with a HAVING count(1) > 1 clause. Then delete from the main table with a join to the temp table. Then insert back.
But my guess is Luis' answer will perform better.
December 31, 2013 at 3:29 am
The good ol' ROW_NUMBER() trick might be worth a try. A few tests on a similar-sized table here indicates relative costs of 20% for the table scan, 30% for the sort and 50% for the table delete. That's about an order of magnitude nicer than I was expecting to see.
Also, see point 10 of this Brad McGehee article[/url]. Deleting from a heap is bad news.
BEGIN TRANSACTION
-- Use a CTE to identify dupe rows for this @StoreOID
;WITH Deleter AS (
SELECT rn = ROW_NUMBER() OVER(PARTITION BY
store_oid, Seq, EmplSeq, JobSeq, OvertimeRuleSeq,
ReasonDefSeq, ClockInStatus, ClockInTimeUtc, ClockOutStatus, ClockOutTimeUtc,
InAdjustEmplSeq, OutAdjustEmplSeq, RegularSeconds, Overtime1Seconds,
Overtime2Seconds, Overtime3Seconds, Overtime4Seconds, AccumulatedDailySeconds,
AccumulatedPeriodSeconds, RegularPay, Overtime1Pay, Overtime2Pay,
Overtime3Pay, Overtime4Pay, RegularRate, AccumulatedDays, ConsecutiveDays,
Computed, BreakSeconds, PaidBreakSeconds, PaidBreakPay, ReportingTimeSeconds, ReportingTimePay, AdjustedClockInTime, AdjustedClockOutTime
ORDER BY (SELECT NULL)),
store_oid, Seq, EmplSeq, JobSeq, OvertimeRuleSeq,
ReasonDefSeq, ClockInStatus, ClockInTimeUtc, ClockOutStatus, ClockOutTimeUtc,
InAdjustEmplSeq, OutAdjustEmplSeq, RegularSeconds, Overtime1Seconds,
Overtime2Seconds, Overtime3Seconds, Overtime4Seconds, AccumulatedDailySeconds,
AccumulatedPeriodSeconds, RegularPay, Overtime1Pay, Overtime2Pay,
Overtime3Pay, Overtime4Pay, RegularRate, AccumulatedDays, ConsecutiveDays,
Computed, BreakSeconds, PaidBreakSeconds, PaidBreakPay, ReportingTimeSeconds, ReportingTimePay, AdjustedClockInTime, AdjustedClockOutTime
FROM ps_micros_e7_time_clock_details
WHERE store_oid = @StoreOID
)
-- and delete them
DELETE FROM Deleter WHERE rn > 1
IF @@ERROR <> 0 GOTO ERR_HANDLER
COMMIT TRANSACTION
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply