July 21, 2008 at 9:40 am
Hi,
I have a table (Cleansing.Customers) which contains AccountNum, DM_Source_Id columns (both form part of a key) and an Address column.
There may be several occurences of AccountNum each with a different DM_Source_Id
I need to return the most frequent occurrence of an Address 'value' for each AccountNum. The 'Address' may change for each DM_Source_Id and I need to represent the most common value (or the first if they are all different).
Any ideas how I do this with a non-numeric (nvarchar) field?
Thanks in advance,
Neal
July 21, 2008 at 9:47 am
DO you have sample data for us?
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
July 21, 2008 at 9:58 am
Ok I thought I might have a look at this with my own data.
Would this work for you?
SELECT DISTINCT [AccountNum],[Address]
FROM
(SELECT
RANK() OVER (PARTITION BY [AccountNum] ORDER BY [Address]) as [Rank]
,[AccountNum]
,[Address]
FROM Cleansing.Customers) nest
WHERE [Rank] = 1
It sounds like a ranking problem so this should head you in the correct direction.
I supose you could also add the rank query to a CTE and then select from there where Rank = 1 not sure how much of a performance difference it would make.
But hope this helps.
Thanks
Chris
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
July 21, 2008 at 10:01 am
I think this may do what you want (I created some dummy data):
[font="Courier New"]DECLARE @customers TABLE(AccountNum INT, DM_Source_Id INT, Address VARCHAR(10))
DECLARE @i INT,
@j INT
SET @i = 1
SET @j = 1
WHILE @i <= 10
BEGIN
WHILE @j <= 10
BEGIN
IF @i = 1
BEGIN
INSERT INTO @customers
SELECT
@i,
@j,
@j
END
ELSE
BEGIN
IF @j % 3 = 0
BEGIN
INSERT INTO @customers
SELECT
@i,
@j,
@j
END
ELSE
BEGIN
INSERT INTO @customers
SELECT
@i,
@j,
@i
END
END
SET @j = @j+ 1
END
SET @i = @i + 1
SET @j = 1
END
;WITH cteTopAddresses AS
(
SELECT
Row_Number() OVER(Partition BY AccountNum ORDER BY AccountNum, Address) AS row_Id,
COUNT(*) AS AddressCount,
Address,
AccountNum
FROM
@customers
GROUP BY
Address,
AccountNum
)
SELECT * FROM cteTopAddresses WHERE row_id = 1
[/font]
Also you need to realize that "first" is a relative expression in SQL Server, it will vary unless you provide an order.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
July 21, 2008 at 10:01 am
You may also want to add this to the order by in the OVER clause
[DM_Source_Id],[Address] Instead of juts address.
That way if they all the different, you will get the one with the lowest [DM_Source_Id], if thats what you want?
Thanks
Chris
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
July 21, 2008 at 10:02 am
This is for Christoper's first reply above...
In the first example, records 1,2,3 & 5 have the same Address value, therefore 'Mcmullen Road/N /n/nDarlington/nDurham' must be returned as it's the most frequent.
AccountNum DM_Source_Id Address
46786 1 Mcmullen Road/N /n/nDarlington/nDurham
46786 2 Mcmullen Road/N /n/nDarlington/nDurham
46786 3 Mcmullen Road/N /n/nDarlington/nDurham
46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham
46786 5 Mcmullen Road/N /n/nDarlington/nDurham
In the next example there are 2 values which appear twice, so either can be returned (I don't mind)
AccountNum DM_Source_Id Address
46786 1 Mcmullen Road/N /n/nDarlington/nDurham
46786 2 Varleys/NMcmullen Road/n/nDarlington/nDurham
46786 3 Mcmullen Road/N /n/nDarlington/nDurham
46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham
Hope this helps.
July 21, 2008 at 10:06 am
HI Jack,
Nice solution.
Is there a big performance difference between mine using a Rank and yours using a Row_number?
Thanks
Chris
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
July 21, 2008 at 10:25 am
Chris,
Thanks. I have not used all the new statistical functions in 2005 yet and had forgotten about rank. I changed my solution to use RANK and had about the same performance, based on Set Statistics Time On and IO ON results. The execution plans were nearly identical, mine had one more step.
I would think that on a real table RANK would probably work better because all mine does is create it's own rank and probably would require more IO.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
July 21, 2008 at 10:30 am
Thanks Jack,
I also did some testing on the data with 5mil rows and the rank seems to be slightly fast (6secs).
With small datasets of 100-1000's the times are pretty much the same in fact it changes with which method is faster.
Thanks , made for some interesting result testing this afternoon 🙂
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
July 21, 2008 at 6:51 pm
Jack Corbett (7/21/2008)
I think this may do what you want (I created some dummy data):
Jack... after you get done building it, change all the Addresses to 0 for where the DM_Source_ID = 1... see what happens... 🙂 I think the data you built is great but I also think it leads to an incorrect result.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 21, 2008 at 7:26 pm
nealwright44 (7/21/2008)
This is for Christoper's first reply above...In the first example, records 1,2,3 & 5 have the same Address value, therefore 'Mcmullen Road/N /n/nDarlington/nDurham' must be returned as it's the most frequent.
AccountNum DM_Source_Id Address
46786 1 Mcmullen Road/N /n/nDarlington/nDurham
46786 2 Mcmullen Road/N /n/nDarlington/nDurham
46786 3 Mcmullen Road/N /n/nDarlington/nDurham
46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham
46786 5 Mcmullen Road/N /n/nDarlington/nDurham
In the next example there are 2 values which appear twice, so either can be returned (I don't mind)
AccountNum DM_Source_Id Address
46786 1 Mcmullen Road/N /n/nDarlington/nDurham
46786 2 Varleys/NMcmullen Road/n/nDarlington/nDurham
46786 3 Mcmullen Road/N /n/nDarlington/nDurham
46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham
Hope this helps.
And, what do you want returned with the eventuality below?
AccountNum DM_Source_Id Address
46786 1 McMahn Road/N /n/nDarlington/nDurham
46786 2 McDonald Road/N /n/nDarlington/nDurham
46786 3 Mcmullen Road/N /n/nDarlington/nDurham
46786 4 St. Ives Road/N /n/nDarlington/nDurham
--Jeff Moden
Change is inevitable... Change for the better is not.
July 22, 2008 at 2:25 am
HI Jeff,
I think he doesn't mind which is returned if they are all different.
I added an extra ordering to my rank query so that the first is returned but that is up to him really.
Thanks
Chris
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
July 22, 2008 at 3:00 am
Possibly I'm oversimplifying, but the following meets my understanding of the requirements...
select top 1 Address from myTable group by Address order by count(*) desc
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
July 22, 2008 at 3:50 am
HI Ryan,
That won't work as it will only return the 1 row.
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
July 22, 2008 at 4:45 am
Hi Chris
Yes, you're right. I was going by the wording by in the examples post, but reading back I see this is needed for each account number. My apologies.
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
Viewing 15 posts - 1 through 15 (of 33 total)
You must be logged in to reply to this topic. Login to reply