Return the most frequent value

Question

Return the most frequent value

DerbyNeal

Hall of Fame

Points: 3854
More actions
July 21, 2008 at 9:40 am

#188315

Hi,
I have a table (Cleansing.Customers) which contains AccountNum, DM_Source_Id columns (both form part of a key) and an Address column.
There may be several occurences of AccountNum each with a different DM_Source_Id
I need to return the most frequent occurrence of an Address 'value' for each AccountNum. The 'Address' may change for each DM_Source_Id and I need to represent the most common value (or the first if they are all different).
Any ideas how I do this with a non-numeric (nvarchar) field?
Thanks in advance,
Neal

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 1

DO you have sample data for us?

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 2

Ok I thought I might have a look at this with my own data.

Would this work for you?

SELECT DISTINCT [AccountNum],[Address]

FROM

(SELECT

RANK() OVER (PARTITION BY [AccountNum] ORDER BY [Address]) as [Rank]

,[AccountNum]

,[Address]

FROM Cleansing.Customers) nest

WHERE [Rank] = 1

It sounds like a ranking problem so this should head you in the correct direction.

I supose you could also add the rank query to a CTE and then select from there where Rank = 1 not sure how much of a performance difference it would make.

But hope this helps.

Thanks

Chris

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

Jack Corbett SSC Guru Points: 184394 More actions · Answer 3

I think this may do what you want (I created some dummy data):

[font="Courier New"]DECLARE @customers TABLE(AccountNum INT, DM_Source_Id INT, Address VARCHAR(10))

DECLARE @i INT,

@j INT

SET @i = 1

SET @j = 1

WHILE @i <= 10

BEGIN

WHILE @j <= 10

BEGIN

IF @i = 1

BEGIN

INSERT INTO @customers

SELECT

@i,

@j,

@j

END

ELSE

BEGIN

IF @j % 3 = 0

BEGIN

INSERT INTO @customers

SELECT

@i,

@j,

@j

END

ELSE

BEGIN

INSERT INTO @customers

SELECT

@i,

@j,

@i

END

SET @j = @j+ 1

END

SET @i = @i + 1

SET @j = 1

END

;WITH cteTopAddresses AS

(

SELECT

Row_Number() OVER(Partition BY AccountNum ORDER BY AccountNum, Address) AS row_Id,

COUNT(*) AS AddressCount,

Address,

AccountNum

FROM

@customers

GROUP BY

Address,

AccountNum

)

SELECT * FROM cteTopAddresses WHERE row_id = 1

[/font]

Also you need to realize that "first" is a relative expression in SQL Server, it will vary unless you provide an order.

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 4

You may also want to add this to the order by in the OVER clause

[DM_Source_Id],[Address] Instead of juts address.

That way if they all the different, you will get the one with the lowest [DM_Source_Id], if thats what you want?

Thanks

Chris

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

DerbyNeal Hall of Fame Points: 3854 More actions · Answer 5

This is for Christoper's first reply above...

In the first example, records 1,2,3 & 5 have the same Address value, therefore 'Mcmullen Road/N /n/nDarlington/nDurham' must be returned as it's the most frequent.

AccountNum DM_Source_Id Address

46786 1 Mcmullen Road/N /n/nDarlington/nDurham

46786 2 Mcmullen Road/N /n/nDarlington/nDurham

46786 3 Mcmullen Road/N /n/nDarlington/nDurham

46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham

46786 5 Mcmullen Road/N /n/nDarlington/nDurham

In the next example there are 2 values which appear twice, so either can be returned (I don't mind)

AccountNum DM_Source_Id Address

46786 1 Mcmullen Road/N /n/nDarlington/nDurham

46786 2 Varleys/NMcmullen Road/n/nDarlington/nDurham

46786 3 Mcmullen Road/N /n/nDarlington/nDurham

46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham

Hope this helps.

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 6

HI Jack,

Nice solution.

Is there a big performance difference between mine using a Rank and yours using a Row_number?

Thanks

Chris

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

Jack Corbett SSC Guru Points: 184394 More actions · Answer 7

Chris,

Thanks. I have not used all the new statistical functions in 2005 yet and had forgotten about rank. I changed my solution to use RANK and had about the same performance, based on Set Statistics Time On and IO ON results. The execution plans were nearly identical, mine had one more step.

I would think that on a real table RANK would probably work better because all mine does is create it's own rank and probably would require more IO.

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 8

Thanks Jack,

I also did some testing on the data with 5mil rows and the rank seems to be slightly fast (6secs).

With small datasets of 100-1000's the times are pretty much the same in fact it changes with which method is faster.

Thanks , made for some interesting result testing this afternoon 🙂

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 9

Jack Corbett (7/21/2008)
I think this may do what you want (I created some dummy data):

Jack... after you get done building it, change all the Addresses to 0 for where the DM_Source_ID = 1... see what happens... 🙂 I think the data you built is great but I also think it leads to an incorrect result.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 10

nealwright44 (7/21/2008)
This is for Christoper's first reply above...
In the first example, records 1,2,3 & 5 have the same Address value, therefore 'Mcmullen Road/N /n/nDarlington/nDurham' must be returned as it's the most frequent.
AccountNum DM_Source_Id Address
46786 1 Mcmullen Road/N /n/nDarlington/nDurham
46786 2 Mcmullen Road/N /n/nDarlington/nDurham
46786 3 Mcmullen Road/N /n/nDarlington/nDurham
46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham
46786 5 Mcmullen Road/N /n/nDarlington/nDurham
In the next example there are 2 values which appear twice, so either can be returned (I don't mind)
AccountNum DM_Source_Id Address
46786 1 Mcmullen Road/N /n/nDarlington/nDurham
46786 2 Varleys/NMcmullen Road/n/nDarlington/nDurham
46786 3 Mcmullen Road/N /n/nDarlington/nDurham
46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham
Hope this helps.

And, what do you want returned with the eventuality below?

AccountNum DM_Source_Id Address

46786 1 McMahn Road/N /n/nDarlington/nDurham

46786 2 McDonald Road/N /n/nDarlington/nDurham

46786 3 Mcmullen Road/N /n/nDarlington/nDurham

46786 4 St. Ives Road/N /n/nDarlington/nDurham

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 11

HI Jeff,

I think he doesn't mind which is returned if they are all different.

I added an extra ordering to my rank query so that the first is returned but that is up to him really.

Thanks

Chris

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

RyanRandall SSChampion Points: 13623 More actions · Answer 12

Possibly I'm oversimplifying, but the following meets my understanding of the requirements...

select top 1 Address from myTable group by Address order by count(*) desc

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 13

HI Ryan,

That won't work as it will only return the 1 row.

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

RyanRandall SSChampion Points: 13623 More actions · Answer 14

Hi Chris

Yes, you're right. I was going by the wording by in the examples post, but reading back I see this is needed for each account number. My apologies.

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.