July 25, 2011 at 8:25 am
Hello guys,
I have the following table with the following data:
create table tbl1
(
ClusterID VARCHAR(20) null,
ID1 VARCHAR(20) null,
ID2 VARCHAR(20) null
)
GO
insert into tbl1 tbl1 values
(null, '1', '2');
insert into tbl1 tbl1 values
(null, '2', '3');
insert into tbl1 tbl1 values
(null, '1', '3');
insert into tbl1 tbl1 values
(null, '4', '5');
insert into tbl1 tbl1 values
(null, '5', '6');
insert into tbl1 tbl1 values
(null, '6', '7');
GO
Is there a way to generate some sort of a Cluster ID based on the columns ID1 and ID2?
The result should be something like this:
ClusterIDID1ID2
-------------------------
CL112
CL123
CL113
CL245
CL256
CL267
Any help is much appreciated.
Thank you!
Andrei
July 25, 2011 at 11:58 am
What defines a cluster? You need something to indicate what each grouping should be.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
July 25, 2011 at 2:20 pm
I gave my input on this thread!
July 26, 2011 at 12:25 am
Guys, thanks for the interest in solving this problem. Apologies for not giving enough information. I have it in my head and sometimes I have the wrong impression that just a few details are enough. I will try again 🙂
I have a table in my database with company information. Unfortunately it does contain duplicated records. I was giving the task to find a solution to remove these duplicates. The problem is that the duplicates are not exact and I used some fuzzy algorithms to identify them.
This is a sample of my table with company info:
IDCompanyName
------------------------
1Alfa Inc
2Alfa
3Alfa Co.
4Beta New York
5Beta Incorporated
6Beta Co
7Beta Oy
8Omega
After deduping I got the following table:
ClusterIDID1ID2
-------------------------
CL112
CL123
CL113
CL245
CL256
CL275
I will explain it a bit:
The company with ID=1 is the same as the one with ID = 2 (Alfa Inc and Alfa). At the same time, the company with ID = 2 is the same as the company with ID=3. Also, ID=1 is the same with ID=3. That means that the first 3 lines should get the same CLusterID. The same with the next three lines (company name = Beta). They should all get a new Cluster ID => CL2.
It is not a general rule that the Cluster is given to only 3 records. It can be only 2 or 5, or etc.
Thanks
July 26, 2011 at 5:32 am
What cluster Id you would give to company named as "Alfa Beta Omega Ltd"?
Do you check company names do match by the first word only?
July 26, 2011 at 7:02 am
I've found your sample inconsistent. I guess you will have all pairs of related Id's
I have updated your sample to have all ID pairs. However my query should work even for incomplete pairs.
But there is one important prerequisite: Id's for related items should be allocated sequentially. For example you should not have something like:
1, 2
1, 6
1, 7
2, 6
2, 7
3, 4
3, 5
4, 5
6, 7
Please also note that clustered index must be created to guarantee results!
create table tbl1
(
ClusterID VARCHAR(20) null,
ID1 VARCHAR(20) null,
ID2 VARCHAR(20) null
)
GO
insert into tbl1 values
(null, '1', '2');
insert into tbl1 values
(null, '2', '3');
insert into tbl1 values
(null, '1', '3');
insert into tbl1 values
(null, '4', '5');
insert into tbl1 values
(null, '4', '6');
insert into tbl1 values
(null, '4', '7');
insert into tbl1 values
(null, '4', '8');
insert into tbl1 values
(null, '5', '6');
insert into tbl1 values
(null, '5', '7');
insert into tbl1 values
(null, '5', '8');
insert into tbl1 values
(null, '6', '7');
insert into tbl1 values
(null, '6', '8');
insert into tbl1 values
(null, '7', '8');
insert into tbl1 values
(null, '9', '10');
GO
select * from tbl1
select cast(null as varchar(13)) as CID, ID1, ID2
into #cid
from tbl1
create unique clustered index ixc_#cid on #cid(ID1, ID2)
declare @cid int
declare @idlist varchar(max)
set @cid = 0
UPDATE #cid
SET @cid = CASE WHEN @idlist IS NULL OR
(CHARINDEX('|'+ CAST( ID1 as varchar(11))+'|', @idlist,0) <= 0 AND
CHARINDEX('|'+ CAST( ID2 as varchar(11))+'|', @idlist,0) <= 0)
THEN @cid + 1 ELSE @cid END
, CID = 'CL' + CAST(@cid as VARCHAR(11))
, @idlist = CASE WHEN @idlist IS NULL OR
(CHARINDEX('|'+ CAST( ID1 as varchar(11))+'|', @idlist,0) <= 0 AND
CHARINDEX('|'+ CAST( ID2 as varchar(11))+'|', @idlist,0) <= 0)
THEN '|' + CAST( ID1 as varchar(11)) + '||' + CAST( ID2 as varchar(11)) + '|'
ELSE @idlist + '|' + CAST( ID2 as varchar(11)) + '|'
END
select * FROM #cid
July 26, 2011 at 7:34 am
Thanks, but if I insert into tbl1 the following row in addition to your rows, the approach will not work anymore:
insert into tbl1 values
(null, '10', '1');
This row should get CL1 as well as 1 is already clustered with CL1. The thing is that it is not me who creates this table. I receive it from a data entry department. they are doing manual checks and provide me companies that are duplicates and I have to generate an unique ID for them.
Not every time the value in the column ID1 < ID2.
Best regards
Andrei
July 26, 2011 at 7:43 am
Sounds like you have a nightmare to deal with. It sounds like your "cluster" is based on the first word of the company name. Is that correct? How do you guarantee that will always work. What if you added Alpha Romeo Motor company? They are no longer part of your cluster because it isn't a dupe. Or you have Alpha Graphics (printing company) and Alpha Company (military supply)? We are willing and able to help but there just isn't much clarity to the problem still.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
July 26, 2011 at 7:51 am
Sean, if the Data Entry department makes mistakes in finding matches, this is their problem, I don't care. My job is to create Cluster IDs based on what they provide me with.
I have written the below code. It works fine and it is what I was searching for. The only small issue I have now is that its performance is low. I have to tell you that I expect a table of ~ 50000 records to process. nice, huh ... :-(.
The code:
------------------------------
drop table dbo.tbl1
go
create table dbo.tbl1
(
ClusterID VARCHAR(20) null,
ID1 VARCHAR(20) null,
ID2 VARCHAR(20) null
)
GO
delete from tbl1;
insert into tbl1 values
(null, '1', '2');
insert into tbl1 values
(null, '2', '3');
insert into tbl1 values
(null, '1', '3');
insert into tbl1 values
(null, '4', '5');
insert into tbl1 values
(null, '5', '6');
insert into tbl1 values
(null, '6', '7');
insert into tbl1 values
(null, '7', '8');
insert into tbl1 values
(null, '9', '10');
insert into tbl1 values
(null, '11', '12');
insert into tbl1 values
(null, '13', '14');
insert into tbl1 values
(null, '14', '1');
insert into tbl1 values
(null, '14', '13');
GO
update dbo.tbl1 set clusterid=null
go
DECLARE @id1 VARCHAR(85)
DECLARE @id2 VARCHAR(85)
declare @found_cluster int
declare @max_cluster int
DECLARE cluster_cursor CURSOR FOR SELECT id1, id2 FROM tbl1
OPEN cluster_cursor
FETCH NEXT FROM cluster_cursor INTO @id1, @id2
WHILE @@FETCH_STATUS = 0
BEGIN
set @max_cluster = (select MAX(clusterid) from dbo.tbl1)
if @max_cluster is null
set @max_cluster = 0
set @found_cluster = (select min(clusterID) from dbo.tbl1
where id1 = @id1 or id2 = @id2 or id1 = @id2 or id2 = @id1)
if @found_cluster is not null
update dbo.tbl1 set clusterID = @found_cluster where id1 = @id1 or id2 = @id2 or id1 = @id2 or id2 = @id1
else
update dbo.tbl1 set clusterid = @max_cluster + 1 where id1 = @id1 and id2=@id2
FETCH NEXT FROM cluster_cursor INTO @id1, @id2
END
CLOSE cluster_cursor
DEALLOCATE cluster_cursor
GO
select * from dbo.tbl1
--------------------------------------
Any help in improving the completion time is much appreciated.
Thanks guys!
Andrei
July 26, 2011 at 8:06 am
OK, sounds like you have inherited some dodgy data, but what can you do.
My understanding of your problem is that you want to "cluster" CompanyNames where the first word is the same.
I'm going to break my approach down into a few steps just for simplicity.
So, first I would add a column, maybe called ShortName, and that would hold the first word of the CompanyName. You can always delete it afterwards. You can do that thus:
alter table #ClusterMe add ShortName varchar(10)
Then populate the shortnames:
UPDATE #ClusterMe SET ShortName = CASE CHARINDEX(' ', CompanyName, 1)
WHEN 0 THEN CompanyName
ELSE SUBSTRING(CompanyName, 1, CHARINDEX(' ', CompanyName, 1) - 1)
END
Coolio. Now we have a column with ShortNames in.
Now to create the clusters, I think you can use dense_rank for this like so:
select CompanyName, ShortName, 'CL' + cast(dense_rank() over ( order by ShortName)as varchar(5)) as ClusterID from #ClusterMe
Hope this helps,
Jamie
July 26, 2011 at 8:18 am
andrei.burdun (7/26/2011)
Thanks, but if I insert into tbl1 the following row in addition to your rows, the approach will not work anymore:insert into tbl1 values
(null, '10', '1');
This row should get CL1 as well as 1 is already clustered with CL1. The thing is that it is not me who creates this table. I receive it from a data entry department. they are doing manual checks and provide me companies that are duplicates and I have to generate an unique ID for them.
Not every time the value in the column ID1 < ID2.
Best regards
Andrei
I have mentioned it as important prerequisite, haven't I?
If the decision to group companies are made by the business, I would suggest that business should allocate some kind of "Company Group Name" and set it to whatever they like.
If companies are groupped by SQL, then you should allocate Cluster ID as part of this SQL, otherwise, I'm afraid, you will need to use cursor as you now do and it will be slow. However 50000 records doesn't sound a lot to me, even my iPad can handle it quite fast 😀
To make your cursor a bit faster, you can reduce number of unnecessary updates: you should only update records where clusterid is not already set.
....
if @found_cluster is not null
update dbo.tbl1 set clusterID = @found_cluster where (id1 = @id1 or id2 = @id2 or id1 = @id2 or id2 = @id1)
and clusterID is null
else
update dbo.tbl1 set clusterid = @max_cluster + 1 where (id1 = @id1 and id2=@id2) and clusterID is null
...
July 26, 2011 at 8:35 am
I am thinking there has to be a set based solution to this but my brain is too full with my own stuff today. 😉
At any rate I moved a few things around inside you cursor which should help a little bit. Unfortunately **cough** cursors **cough** are slow beasts.
DECLARE @id1 VARCHAR(85)
DECLARE @id2 VARCHAR(85)
declare @found_cluster int
--declare @max_cluster int --don't really need this
DECLARE cluster_cursor CURSOR FOR SELECT id1, id2 FROM tbl1
OPEN cluster_cursor
FETCH NEXT FROM cluster_cursor INTO @id1, @id2
WHILE @@FETCH_STATUS = 0
BEGIN
set @found_cluster = (select min(clusterID) from dbo.tbl1
where id1 in (@id1, @id2) or id2 in (@id1, @id2)
if @found_cluster is not null
update dbo.tbl1 set clusterID = @found_cluster where id1 in (@id1, @id2) or id2 in (@id1, @id2) and clusterID is null
else
begin
--moved this inside so you don't do this every iteration.
--also moved the if null and such to a single statement
update dbo.tbl1 set clusterid = (select ISNULL(MAX(clusterid), 0) + 1 from dbo.tbl1)
where id1 = @id1 and id2 = @id2 and clusterID is null
end
FETCH NEXT FROM cluster_cursor INTO @id1, @id2
END
CLOSE cluster_cursor
DEALLOCATE cluster_cursor
If I get caught up and have some time later this afternoon I will try to revisit this and see if there isn't a way to make this set based.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
July 26, 2011 at 9:30 am
Sean Lange (7/26/2011)
I am thinking there has to be a set based solution to this but my brain is too full with my own stuff today. 😉...
There is...
At least one, and hope it will be faster than a cursor :hehe::
--drop table tbl1
create table tbl1
(
ClusterID VARCHAR(20) null,
ID1 VARCHAR(20) null,
ID2 VARCHAR(20) null
)
GO
insert into tbl1 values
(null, '1', '2');
insert into tbl1 values
(null, '1', '6');
insert into tbl1 values
(null, '1', '7');
insert into tbl1 values
(null, '2', '6');
insert into tbl1 values
(null, '2', '7');
insert into tbl1 values
(null, '3', '4');
insert into tbl1 values
(null, '3', '5');
insert into tbl1 values
(null, '4', '5');
insert into tbl1 values
(null, '6', '7');
GO
--drop table #cid
select cast(null as varchar(30)) as CID, ID1, ID2
into #cid
from tbl1
create unique clustered index ixc_#cid on #cid(ID1, ID2)
declare @cid int
declare @clcid char(13)
declare @idlist varchar(max)
set @cid = 0
set @idlist = ''
UPDATE #cid
SET @cid = CASE WHEN @idlist = '' OR
(CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) <= 0 AND
CHARINDEX('|'+ CAST( ID2 as char(11))+'|', @idlist,0) <= 0)
THEN @cid + 1 ELSE @cid END
, CID = rtrim(CASE WHEN CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) > 0
THEN SUBSTRING(@idlist,
CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) + 14,
13)
WHEN CHARINDEX('|'+ CAST( ID2 as varchar(11))+'|', @idlist,0) > 0
THEN SUBSTRING(@idlist,
CHARINDEX('|'+ CAST( ID2 as char(11))+'|', @idlist,0) + 14,
13)
ELSE 'CL' + CAST(@cid as VARCHAR(11))
END)
, @clcid = 'CL' + CAST(@cid as VARCHAR(11))
, @idlist = CASE WHEN @idlist = '' OR
(CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) <= 0 AND
CHARINDEX('|'+ CAST( ID2 as char(11))+'|', @idlist,0) <= 0)
THEN @idlist + '|' + CAST( ID1 as char(11)) + '|~' + @clcid +'~|' + CAST( ID2 as char(11)) + '|~' +@clcid + '~'
ELSE @idlist + '|' + CAST( ID2 as char(11)) + '|~' + @clcid + '~'
END
select * FROM #cid
July 26, 2011 at 10:42 am
Sweet. I am guessing that is way faster than that old nasty cursor. Nice work! Will have to see if this meets the criteria for the OP but it looks like it does to the best of my understanding anyway.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
January 25, 2012 at 1:19 pm
Thank you to all those involved in this brainstorming. I really appreciate your help. Will go on with the last suggested approach. It seems correct and fast.
Andrei
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply