March 31, 2005 at 2:50 am
Hi all,
I have a problem. I have the following script which deletes two kinds of duplicates - ones where there are 2 or more rows with the same keys, and ones where there are 2 or more rows of duplicates according to business rules. The problem I have is that I also need to delete the corresponding row on another table, but this is proving difficult. The script selects the duplicates, then subtracts 1 from rowcount and then does the delete. This way all but 1 of the duplicates is deleted. The problem is that I do not know which row is left behind and cannot therefore identify the corresponding row to delete. I cannot select the key of the corresponding row when I select the duplicates as this field would make each row uniqe and therefore it would not appear in my duplicates list (each row on sickness has an individual sickid). The tables look like this:
Sickness SicknessDays
payroll sickid
staffid other details....
start date
end date
sickid
And here is the script:
/*Delete complete duplicates from Sickness table*/
Declare @payroll varchar(8),
@staff_id int,
@start_date datetime,
@end_date datetime,
@cnt int
Declare getfullrecords cursor local static For
Select count (1), payroll, staff_id, start_date, end_date
from sickness (nolock)
group by payroll, staff_id, start_date, end_date having count(1)>1
Open getfullrecords
Fetch next from getfullrecords into @cnt,@payroll,@staff_id,@start_date, @end_date
--Cursor to check with all other records
While @@fetch_status=0
Begin
Set @cnt= @cnt-1
Set rowcount @cnt
Delete from sickness where payroll = @payroll and staff_id = @staff_id
and start_date = @start_date and end_date = @end_date
--Delete the corresponding rec from sicknessdays but how to find it? Link is sickid - not retrieved.
Set rowcount 0
Fetch next from getfullrecords into @cnt, @payroll, @staff_id, @start_date, @end_date
End
Close getfullrecords
Deallocate getfullrecords
/* Deletion of duplicate records with different payroll nums from the Sickness table */
Declare getallrecords cursor local static For
Select count (1), staff_id, start_date, end_date
from sickness (nolock)
group by staff_id, start_date, end_date having count(1)>1
Open getallrecords
Fetch next from getallrecords into @cnt,@staff_id,@start_date, @end_date
--Cursor to check with all other records
While @@fetch_status=0
Begin
Set @cnt= @cnt-1
Set rowcount @cnt
Delete from sickness where staff_id = @staff_id
and start_date = @start_date and end_date = @end_date
--Delete the corresponding rec from sicknessdays but how to find it? Link is sickid - not retrieved.
Set rowcount 0
Fetch next from getallrecords into @cnt, @staff_id, @start_date, @end_date
End
Close getallrecords
Deallocate getallrecords
I would be very grateful of any suggestions as to how I would delete the corresponding row.
Thank you,
Paula.
March 31, 2005 at 3:01 am
If the correct link to sickness days is sickid, then you should retrieve it, and use that to find the corresponding sicknessdays.
If you can't use it because the dupes is made out of other column(s) values (according to the business rules) then you need to use those value(s) as 'duplicate-PK' in order to find the rows.
/Kenneth
March 31, 2005 at 3:09 am
If I do this:
Declare getfullrecords cursor local static For
Select count (1), payroll, staff_id, start_date, end_date, sickid
from sickness (nolock)
group by payroll, staff_id, start_date, end_date, sickid having count(1)>1
instead of this:
Declare getfullrecords cursor local static For
Select count (1), payroll, staff_id, start_date, end_date
from sickness (nolock)
group by payroll, staff_id, start_date, end_date having count(1)>1
then I end up with no duplicates because sickid is unique on each row. I cannot select it as part of my duplicates criteria. And I can't do another select statement to get it after this because I do not know which rows are being deleted (because of the rowcount code) and therefore do not know which to select.
March 31, 2005 at 3:10 am
If I do this:
Declare getfullrecords cursor local static For
Select count (1), payroll, staff_id, start_date, end_date, sickid
from sickness (nolock)
group by payroll, staff_id, start_date, end_date, sickid having count(1)>1
instead of this:
Declare getfullrecords cursor local static For
Select count (1), payroll, staff_id, start_date, end_date
from sickness (nolock)
group by payroll, staff_id, start_date, end_date having count(1)>1
then I end up with no duplicates because sickid is unique on each row. I cannot select it as part of my duplicates criteria. And I can't do another select statement to get it after this because I do not know which rows are being deleted (because of the rowcount code) and therefore do not know which to select.
March 31, 2005 at 3:18 am
Ok, I guess you must rewrite the delete procedure..
Let's start from the beginning What makes up a duplicate (ie which columns) from the 'main' table?
/Kenneth
March 31, 2005 at 3:24 am
Thanks so much for helping.....
The tables look like I mentioned in my first post. A duplicate on Sickness is classed as one of two things:
1. Where payroll, staffid, start_date and end_date are the same.
2. Where staffid, start_date, and end_date are the same but payroll is different (this is because our company swaps people's payroll nums when they feel like it!)
Each row that is inserted on this table gets assigned a unique sickid automatically. This is the field I need to use to find the corresponding rec on Sicknessdays and delete it.
March 31, 2005 at 3:45 am
The other thing I should mention is that there can be more than 2 of each record. There can be any number of duplicates for each record. (I say this because I originally wrote a script which deleted the corresponding rec no probs but could only handle 1 dup per rec!).
March 31, 2005 at 3:59 am
Ah, but if you already have a working script that deals with 1 dup, just run it several times then..?? Shouldn't that do it?
/Kenneth
March 31, 2005 at 4:09 am
Ha ha, well kind of except I have other things to do than sit running de-dup scripts for the rest of my life! Also this job needs to be scheduled because although in a perfect world it should be a one-off, in my world the idiots send the same file of rubbish data to us every week! They reckon their old mainframe system can't do it any other way and so we have to pick up the pieces until we buy a lovely new system to do it all. Incidentally, I work in the public sector, hence the council not wanting to spend tax payers money on replacing old systems that in their eyes work! Sorry, just my rant for the day!
So as a result, I have two scripts and I need a combination of them but have been sat staring at it for days now. The one I posted seemed like the best option, as the other can only delete the corresponding rec in situation 2 (diff payroll nums) and cannot handle more than one dup. Sorry to be such a pain!!
March 31, 2005 at 6:36 am
Okay, so.. you say that a dupe is made up of either of these, right..?
1. Where payroll, staffid, start_date and end_date are the same.
2. Where staffid, start_date, and end_date are the same but payroll is different (this is because our company swaps people's payroll nums when they feel like it!)
By these rules, if 2) is a dupe, then by definition so is also 1) - so you can skip 1) as a rule, 2) will suffice.
Unless... if it's possible that you could have the same staffid with different payrolls and the're actuall not a dupe..? In that case, you're in deep ****
So, your 'working script' seems to be the only one you need? (unless I'm missing something here..)
/Kenneth
March 31, 2005 at 6:44 am
For a minute there I thought you were right! You're right about the 2 scenarios - yes I do only need to do one de-dup. But that leaves me with my original problem. I have a script which deals with scenario 2 (diff payrolls) which as you have pointed out will also deal with scenario 1 (thanks for that by the way! - can't see the wood for the trees anymore!). It also deletes the corresponding record which was my initial reason for starting this thread. However what it doesn't do is more than one duplicate. It can't cope with 3 or more dups.
My other script (the one I posted above) can cope with x num of dups but I cannot delete the corresponding record on sicknessdays.
Aaaaargh! This is driving me nuts!
But thank you for your help, I have got in so deep now I'm just going round in circles, a fresh viewpoint is helping me see clearly again!
March 31, 2005 at 7:18 am
Hi just a thought on the logic: First the critical information is StaffID, start date, sickness and sicknessdays. The rest of the information has no use at this time. If I understand what you are trying to do is. Find duplicates for a unique employee (StaffID) where the dates of the sickness are equal.
If you select for start-date and end-date being equal then an employee can be paid twice. ie. start-date1 = 1/1/00 and end-date1=1/5/00 for stafid xyz. start-date2=1/1/00 and end-date2 = 2/5/00 and stafid = xyz. Your select statement leaves both records as the end dates are different.
If your DB relatinships are correct deleting a sickness should also delete the sicknessdays as there is a one to one relationship. If not the DBA should fix these relationships to prevent orphan records.
I have to run and chase the almighty $ HTH
Mike
March 31, 2005 at 7:28 am
Thanks for reminding me of that! As a rule we do not use cascading deletes as we leave it up to the programmers to decide. It's safer that way, but in this instance and as I have been banging my head against a brick wall I think you are right. Cascading deletes it is! And now I am kicking myself for not remembering that fact earlier (or at all!).
Many thanks everyone!
March 31, 2005 at 7:32 am
Hmm... quite complicated, not sure whether I haven't missed something... what about this (I will write as I'm thinking, so that you can follow my reasoning and decide whether I'm not completely off topic):
1. get the sickid of all records that should not be deleted (say, the one with lowest sickid for every dupe group)
select payroll, staffid, start_date, end_date, min(sickid)
from sickness
group by payroll, staffid, start_date, end_date
2. get the sickid of records that should be deleted (=other than those in the previous step)
SELECT s.sickid
FROM sickness s
LEFT JOIN
(select payroll, staffid, start_date, end_date, min(sickid)
from sickness
group by payroll, staffid, start_date, end_date) AS ok ON ok.sickid = s.sickid
WHERE ok.sickid IS NULL
3. now you have a list of all the sickid that you will delete, and you can also delete any corresponding records.
Would this work on your DB, or have I missed something?
March 31, 2005 at 7:44 am
Ok this makes sense and I'd rather do this than add cascade delete. I'll try to get it working and let you all know. First problem though is that the second statement gives the error:
"No column was specified for column 5 of 'ok'."
I've tried a couple of things but no luck yet.
Anyway, thanks, I'll work on it.
Viewing 15 posts - 1 through 15 (of 32 total)
You must be logged in to reply to this topic. Login to reply