March 4, 2016 at 12:57 am
Hi All,
Hope you are doing great. Here I have a problem and I am sure you will help me out.
I have a table with full of duplicates. I have the following fields in my table.
Contactid (Unique), FirstName, LastName, Officerid, PhoneNo email, cellphone, address1
select contactid,FirstName,lastname,officerid,
email,,HomeAddress1
from contacts
where firstname='Niladri' and lastname='Biswas'
and officerid='jfortw'
order by contactid desc
If I run the above query I am getting 6 records which is fine. I am getting the 6 unique contact id also. I need to remove the all the 5 duplicates except the first one. As the first contactid is newest one which is inserted into the table.
Before I remove the records I need to update the other info like phoneno, email, address1 from other IDs.
Now I have 6 contact Ids like 1, 2,3,4,5 and 6. I will show only the reocrd 6 but it will update phoneno, email, address1 from other ids. for example if id 1 has the phone no then it will update the phone no of contact 6, if id 2 has the address then it will update the address of contact 6 and go on...
Lastly it will keep the records only 6 contact id and then delete all the other records.
The records are unique based on OfficerID, LastName,FirstName
I need to do same task for all the unique records. what is the best way to clean the table.
Please guide me
Thanks in advance
Niladri
March 4, 2016 at 2:03 am
niladri.primalink (3/4/2016)
Hi All,Hope you are doing great. Here I have a problem and I am sure you will help me out.
I have a table with full of duplicates. I have the following fields in my table.
Contactid (Unique), FirstName, LastName, Officerid, PhoneNo email, cellphone, address1
select contactid,FirstName,lastname,officerid,
email,,HomeAddress1
from contacts
where firstname='Niladri' and lastname='Biswas'
and officerid='jfortw'
order by contactid desc
If I run the above query I am getting 6 records which is fine. I am getting the 6 unique contact id also. I need to remove the all the 5 duplicates except the first one. As the first contactid is newest one which is inserted into the table.
Before I remove the records I need to update the other info like phoneno, email, address1 from other IDs.
Now I have 6 contact Ids like 1, 2,3,4,5 and 6. I will show only the reocrd 6 but it will update phoneno, email, address1 from other ids. for example if id 1 has the phone no then it will update the phone no of contact 6, if id 2 has the address then it will update the address of contact 6 and go on...
Lastly it will keep the records only 6 contact id and then delete all the other records.
The records are unique based on OfficerID, LastName,FirstName
I need to do same task for all the unique records. what is the best way to clean the table.
Please guide me
Thanks in advance
Niladri
When you post questions like this it is much appreciated when you help us help you by providing DDL for your table in the form of a CREATE TABLE statement plus all indexes and constraints -and- DML to create test data in that table in the form of INSERT INTO statements so folks can easily test in their sandbox environment and provide you with a working and tested code-solution that matches your environment.
Is it safe to assume there are other address columns in the table, e.g. city, state, zip? And will those all need to be moved forward together? For example if row 6 in your example has a HomeAddress value but not City but Row 3 has City would you want to blend the City from one row with the HomeAddress of another risking creating an incorrect address in row 6 instead of just having an incomplete one? What other columns in the table would you want this solution to handle bringing data forward for?
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
March 4, 2016 at 2:16 am
Hi,
Thanks fro you reply. Next time I will be careful. Here are all the columns the table I have
contactid,FirstName,lastname,officerid,
email,Homephone,workphone,Fax,HomeAddress1,HomeAddress2,Homecity,HomeState,
HomeZip,Title,workaddress1,workaddress2,workcity,workcity,workstate,workzip
Hope it will help you. Please guide me.
Also let me know if you need any other info
March 4, 2016 at 2:30 am
niladri.primalink (3/4/2016)
Hi,Thanks fro you reply. Next time I will be careful. Here are all the columns the table I have
contactid,FirstName,lastname,officerid,
email,Homephone,workphone,Fax,HomeAddress1,HomeAddress2,Homecity,HomeState,
HomeZip,Title,workaddress1,workaddress2,workcity,workcity,workstate,workzip
Hope it will help you. Please guide me.
Also let me know if you need any other info
Nice try, but please re-read my post. You can get the DDL from SSMS Object Explorer. Plus we will need some INSERT statements to create some test data, from you.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
March 4, 2016 at 2:59 am
OK..Thanks got it. Here is the query which will help
/****** Object: Table [dbo].[contacts2] Script Date: 3/4/2016 3:26:55 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[contacts2](
[ContactID] [int] IDENTITY(1,1) NOT NULL,
[OfficerID] [varchar](8) NULL,
[FirstName] [varchar](50) NULL,
[LastName] [varchar](75) NULL,
[Email] [varchar](80) NULL,
[HomePhone] [varchar](25) NULL,
[CellPhone] [varchar](25) NULL,
[WorkPhone] [varchar](25) NULL,
[HomeAddress1] [varchar](80) NULL,
[HomeAddress2] [varchar](50) NULL,
[HomeCity] [varchar](50) NULL,
[HomeState] [varchar](2) NULL,
[HomeZip] [varchar](10) NULL,
[WorkAddress1] [varchar](80) NULL,
[WorkAddress2] [varchar](50) NULL,
[WorkCity] [varchar](50) NULL,
[WorkState] [varchar](2) NULL,
[WorkZip] [varchar](10) NULL,
CONSTRAINT [PK_Contacts2] PRIMARY KEY CLUSTERED
(
[ContactID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
SET IDENTITY_INSERT [dbo].[contacts2] ON
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (1, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (2, N'jfortq', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (3, N'jfortm', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (4, N'jfortn', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', N'Irvine', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (5, N'jfortt', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, N'Irvine', N'CA', N'92618', NULL, NULL, NULL, NULL, NULL)
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (6, N'jfortp', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'', N'', NULL, N'', NULL, NULL, NULL, NULL, N'114 Pacifica Ste 120', NULL, N'Irvine', N'CA', N'92618')
GO
SET IDENTITY_INSERT [dbo].[contacts2] OFF
GO
let me know if you need anything from me
Again thanks for your reply
Niladri
March 4, 2016 at 3:10 am
Great, thanks. From my earlier post on how to handle address data, using your sample data what do you want retained in the HomeAddress of the surviving row? Would you simply want the complete address from Contact 5 or would you want HomeAddress1, HomeCity, HomeState and HomeZip from Contact 5 and HomeAddress2 from Contact 4 blended together? Or something else?
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
March 4, 2016 at 3:31 am
if there is address is Available in last row (in this case row 6) we don't need to update. But if the row no 6 has not the address and the row no 2, 4, 5 has the address then we can update from any of the value. if the the row 2 has the city name but not the address and row no 5 has the address not the city name then we need the city name from row 2 and address from row 5. We need the complete info for row no 6 if the other row has the value . if there is nothing we can omit. After the updating it will delete all the rows except the latest one (in that case row no 6)
we need to go other example after that.
Thanks a lot for using your valuable time for me.
March 4, 2016 at 3:54 am
niladri.primalink (3/4/2016)
if there is address is Available in last row (in this case row 6) we don't need to update. But if the row no 6 has not the address and the row no 2, 4, 5 has the address then we can update from any of the value. if the the row 2 has the city name but not the address and row no 5 has the address not the city name then we need the city name from row 2 and address from row 5. We need the complete info for row no 6 if the other row has the value . if there is nothing we can omit. After the updating it will delete all the rows except the latest one (in that case row no 6)we need to go other example after that.
Thanks a lot for using your valuable time for me.
Your sample data is throwing me off. All OfficerID values are different. I think for a valid example you need a set where they are all the same, along with FirstName and LastName. Correct?
Can you please post a new set of INSERT statements along with the expected results from that sample data?
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
March 4, 2016 at 4:09 am
My Sample data is looking like below. Sorry image is not working in my system.
contactidFirstNamelastnameofficeridemailcellphoneHomephoneworkphoneHomeAddress1HomeAddress2HomecityHomeStateHomeZipworkaddress1workaddress2workcityworkcityworkstateworkzip
1NiladriBiswasjfortwniladrri@gmail.comNULL257-098-3456NULLNULLNULLNULLNULLNULLNULLNULLNULLNULLNULLNULL
2NiladriBiswasjfortwniladrri@gmail.com232-567-879257-098-3456NULL114 Pacifica Ste 120NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL
3NiladriBiswasjfortwniladrri@gmail.com232-567-879257-098-3456NULL114 Pacifica Ste 120NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL
4NiladriBiswasjfortwniladrri@gmail.com232-567-879257-098-3456NULL114 Pacifica Ste 120IrvineNULLNULLNULLNULLNULLNULLNULLNULLNULL
5NiladriBiswasjfortwniladrri@gmail.com232-567-879257-098-3456NULL114 Pacifica Ste 120NULLIrvineCA92618NULLNULLNULLNULLNULLNULL
6NiladriBiswasjfortwniladrri@gmail.comNULLNULLNULLNULLNULL114 Pacifica Ste 120NULLIrvineIrvineCA92618
My output should like this
contactidFirstNamelastnameofficeridemailcellphoneHomephoneworkphoneHomeAddress1HomeAddress2HomecityHomeStateHomeZipworkaddress1workaddress2workcityworkcityworkstateworkzip
6NiladriBiswasjfortwniladrri@gmail.com232-567-879257-098-3456NULL114 Pacifica Ste 120IrvineIrvineCA92618114 Pacifica Ste 120NULLIrvineIrvineCA92618
Here is the insert query
SET IDENTITY_INSERT [dbo].[contacts2] ON
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (1, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (2, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (3, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (4, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', N'Irvine', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (5, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, N'Irvine', N'CA', N'92618', NULL, NULL, NULL, NULL, NULL)
GO
INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (6, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'', N'', NULL, N'', N'', NULL, NULL, NULL, N'114 Pacifica Ste 120', NULL, N'Irvine', N'CA', N'92618')
GO
SET IDENTITY_INSERT [dbo].[contacts2] OFF
GO
Yes you are right we need to consolidate records. the records are unique based on OfficerID, LastName,FirstName
Again thanks.
Niladri
March 4, 2016 at 7:34 am
A word of caution, this query will not perform well over large datasets. Your data needs to be processed in a partitioned fashion (by OfficerID, FirstName and LastName) and then within that row-by-row, per-column. What I handed over is basically a cursor masquerading as set-based code but I could not come up with another way to get there short of writing an actual cursor (I tried a variation of OUTER APPLY as well but this actually did better over a large dataset). Maybe someone else will come along and provide a better way to solve this one. I think we did a good job of getting to the bottom of your requirement at least.
You did not provide any indexes with your table definition. Does that mean there are none or that you just did not provide them? If there is no index where the leading edge consists of OfficerID, FirstName and LastName in some order you should consider creating one to support this query.
[font="Courier New"]WITH cte AS (SELECT
MAX(ContactID) AS ContactID,
OfficerID,
FirstName,
LastName
FROM dbo.contacts2
GROUP BY OfficerID,
FirstName,
LastName)
SELECT ContactID,
OfficerID,
FirstName,
LastName,
(SELECT TOP 1 HomePhone FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomePhone > '' ORDER BY ContactID DESC) AS HomePhone,
(SELECT TOP 1 CellPhone FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND CellPhone > '' ORDER BY ContactID DESC) AS CellPhone,
(SELECT TOP 1 WorkPhone FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND WorkPhone > '' ORDER BY ContactID DESC) AS WorkPhone,
(SELECT TOP 1 HomeAddress1 FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomeAddress1 > '' ORDER BY ContactID DESC) AS HomeAddress1,
(SELECT TOP 1 HomeAddress2 FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomeAddress2 > '' ORDER BY ContactID DESC) AS HomeAddress2,
(SELECT TOP 1 HomeCity FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomeCity > '' ORDER BY ContactID DESC) AS HomeCity,
(SELECT TOP 1 HomeState FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomeState > '' ORDER BY ContactID DESC) AS HomeState,
(SELECT TOP 1 HomeZip FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomeZip > '' ORDER BY ContactID DESC) AS HomeZip
FROM cte;
[/font]
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
March 4, 2016 at 8:43 am
I'm very cautious about automatically updating records like this, because there is so much variation for entering the same information. For instance, addresses with apartment numbers. If you have records like
100 Main St, Apt 2A | | Anywhere | PA | 191xx
100 Main St | Apt 2A | Anywhere | PA | 191xx
After merging, you will get
100 Main St, Apt 2A | Apt 2A | Anywhere | PA | 191xx
with the apartment number appear in both lines 1 and 2 of the address.
I also had one person sign his name in each of the following ways:
First Middle Last
F. Middle Last
F. M. Last
I would constantly have to merge his records together, because people would reintroduce records with one of the abbreviated forms.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
March 4, 2016 at 9:25 am
drew.allen (3/4/2016)
I'm very cautious about automatically updating records like this, because there is so much variation for entering the same information. For instance, addresses with apartment numbers. If you have records like100 Main St, Apt 2A | | Anywhere | PA | 191xx
100 Main St | Apt 2A | Anywhere | PA | 191xx
After merging, you will get
100 Main St, Apt 2A | Apt 2A | Anywhere | PA | 191xx
with the apartment number appear in both lines 1 and 2 of the address.
I also had one person sign his name in each of the following ways:
First Middle Last
F. Middle Last
F. M. Last
I would constantly have to merge his records together, because people would reintroduce records with one of the abbreviated forms.
Drew
I had the same concern in my earlier line of questions but it seems it is not a problem for the OP to blend the columns across rows. Even in the sample data you can see where HomeAddress2 has a data quality issue that will result in a city name appearing twice on a printed mailing label. Luckily this requirement does not involve names but it has the same type of pitfalls as you point out.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
March 4, 2016 at 11:25 pm
Hi Thanks for your reply...
It is absolutely the same I am looking for. Thanks a lot. I will update the table from cte. Is there any chance to delete the other records ? I want to automate this through SSIS.
But like you I am also wondering. there are almost 43 lakhs records in that table and unique is almost 31 lakhs (comination of firstname,lastname and officerid ).
Again thanks a lot.
March 5, 2016 at 1:47 am
niladri.primalink (3/4/2016)
Hi Thanks for your reply...It is absolutely the same I am looking for. Thanks a lot. I will update the table from cte. Is there any chance to delete the other records ? I want to automate this through SSIS.
But like you I am also wondering. there are almost 43 lakhs records in that table and unique is almost 31 lakhs (comination of firstname,lastname and officerid ).
Again thanks a lot.
You might want to persist the results from the CTE to a #temp table so you can not only use it to update what will be the surviving rows but also to delete all other rows where the ContactID is not in the #temp table. You may also want to consider only deleting rows with a ContactID less than the max value in the #temp table in case new rows enter the permanent table after you start processing.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
March 7, 2016 at 1:44 am
Hi Orlando,
your query works like a magic. Thanks a lot. What I have done is that I have taken the records into a new table and then match the records with the contacts and then Updated the base table and after that I have deleted the duplicates.
Thanks a lot. Definitely I will love to take help from you later.
Regards,
Niladri
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply