Return the most frequent value

  • Hi,

    I have a table (Cleansing.Customers) which contains AccountNum, DM_Source_Id columns (both form part of a key) and an Address column.

    There may be several occurences of AccountNum each with a different DM_Source_Id

    I need to return the most frequent occurrence of an Address 'value' for each AccountNum. The 'Address' may change for each DM_Source_Id and I need to represent the most common value (or the first if they are all different).

    Any ideas how I do this with a non-numeric (nvarchar) field?

    Thanks in advance,

    Neal

  • DO you have sample data for us?

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • Ok I thought I might have a look at this with my own data.

    Would this work for you?

    SELECT DISTINCT [AccountNum],[Address]

    FROM

    (SELECT

    RANK() OVER (PARTITION BY [AccountNum] ORDER BY [Address]) as [Rank]

    ,[AccountNum]

    ,[Address]

    FROM Cleansing.Customers) nest

    WHERE [Rank] = 1

    It sounds like a ranking problem so this should head you in the correct direction.

    I supose you could also add the rank query to a CTE and then select from there where Rank = 1 not sure how much of a performance difference it would make.

    But hope this helps.

    Thanks

    Chris

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • I think this may do what you want (I created some dummy data):

    [font="Courier New"]DECLARE @customers TABLE(AccountNum INT, DM_Source_Id INT, Address VARCHAR(10))

    DECLARE @i INT,

                 @j INT

    SET @i = 1

    SET @j = 1

    WHILE @i <= 10

        BEGIN

            WHILE @j <= 10

                BEGIN

                    IF @i = 1

                        BEGIN

                            INSERT INTO @customers

                                SELECT

                                    @i,

                                    @j,

                                    @j

                        END

                    ELSE

                        BEGIN

                            IF @j % 3 = 0

                                BEGIN

                                    INSERT INTO @customers

                                        SELECT

                                            @i,

                                            @j,

                                            @j

                                END

                            ELSE

                                BEGIN

                                       INSERT INTO @customers

                                            SELECT

                                                @i,

                                                @j,

                                                @i

                                END

                        END

                    SET @j = @j+ 1

                END

            

            SET @i = @i + 1

            SET @j = 1

        END

    ;WITH cteTopAddresses AS

        (

        SELECT

            Row_Number() OVER(Partition BY AccountNum ORDER BY AccountNum, Address) AS row_Id,

            COUNT(*) AS AddressCount,

            Address,

            AccountNum

        FROM

            @customers

        GROUP BY

            Address,

            AccountNum

      

        )

        SELECT * FROM cteTopAddresses WHERE row_id = 1

    [/font]

    Also you need to realize that "first" is a relative expression in SQL Server, it will vary unless you provide an order.

  • You may also want to add this to the order by in the OVER clause

    [DM_Source_Id],[Address] Instead of juts address.

    That way if they all the different, you will get the one with the lowest [DM_Source_Id], if thats what you want?

    Thanks

    Chris

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • This is for Christoper's first reply above...

    In the first example, records 1,2,3 & 5 have the same Address value, therefore 'Mcmullen Road/N /n/nDarlington/nDurham' must be returned as it's the most frequent.

    AccountNum DM_Source_Id Address

    46786 1 Mcmullen Road/N /n/nDarlington/nDurham

    46786 2 Mcmullen Road/N /n/nDarlington/nDurham

    46786 3 Mcmullen Road/N /n/nDarlington/nDurham

    46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham

    46786 5 Mcmullen Road/N /n/nDarlington/nDurham

    In the next example there are 2 values which appear twice, so either can be returned (I don't mind)

    AccountNum DM_Source_Id Address

    46786 1 Mcmullen Road/N /n/nDarlington/nDurham

    46786 2 Varleys/NMcmullen Road/n/nDarlington/nDurham

    46786 3 Mcmullen Road/N /n/nDarlington/nDurham

    46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham

    Hope this helps.

  • HI Jack,

    Nice solution.

    Is there a big performance difference between mine using a Rank and yours using a Row_number?

    Thanks

    Chris

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • Chris,

    Thanks. I have not used all the new statistical functions in 2005 yet and had forgotten about rank. I changed my solution to use RANK and had about the same performance, based on Set Statistics Time On and IO ON results. The execution plans were nearly identical, mine had one more step.

    I would think that on a real table RANK would probably work better because all mine does is create it's own rank and probably would require more IO.

  • Thanks Jack,

    I also did some testing on the data with 5mil rows and the rank seems to be slightly fast (6secs).

    With small datasets of 100-1000's the times are pretty much the same in fact it changes with which method is faster.

    Thanks , made for some interesting result testing this afternoon 🙂

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • Jack Corbett (7/21/2008)


    I think this may do what you want (I created some dummy data):

    Jack... after you get done building it, change all the Addresses to 0 for where the DM_Source_ID = 1... see what happens... 🙂 I think the data you built is great but I also think it leads to an incorrect result.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • nealwright44 (7/21/2008)


    This is for Christoper's first reply above...

    In the first example, records 1,2,3 & 5 have the same Address value, therefore 'Mcmullen Road/N /n/nDarlington/nDurham' must be returned as it's the most frequent.

    AccountNum DM_Source_Id Address

    46786 1 Mcmullen Road/N /n/nDarlington/nDurham

    46786 2 Mcmullen Road/N /n/nDarlington/nDurham

    46786 3 Mcmullen Road/N /n/nDarlington/nDurham

    46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham

    46786 5 Mcmullen Road/N /n/nDarlington/nDurham

    In the next example there are 2 values which appear twice, so either can be returned (I don't mind)

    AccountNum DM_Source_Id Address

    46786 1 Mcmullen Road/N /n/nDarlington/nDurham

    46786 2 Varleys/NMcmullen Road/n/nDarlington/nDurham

    46786 3 Mcmullen Road/N /n/nDarlington/nDurham

    46786 4 Varleys/NMcmullen Road/n/nDarlington/nDurham

    Hope this helps.

    And, what do you want returned with the eventuality below?

    AccountNum DM_Source_Id Address

    46786 1 McMahn Road/N /n/nDarlington/nDurham

    46786 2 McDonald Road/N /n/nDarlington/nDurham

    46786 3 Mcmullen Road/N /n/nDarlington/nDurham

    46786 4 St. Ives Road/N /n/nDarlington/nDurham

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • HI Jeff,

    I think he doesn't mind which is returned if they are all different.

    I added an extra ordering to my rank query so that the first is returned but that is up to him really.

    Thanks

    Chris

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • Possibly I'm oversimplifying, but the following meets my understanding of the requirements...

    select top 1 Address from myTable group by Address order by count(*) desc

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • HI Ryan,

    That won't work as it will only return the 1 row.

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • Hi Chris

    Yes, you're right. I was going by the wording by in the examples post, but reading back I see this is needed for each account number. My apologies.

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply