Work out which column value is repeated the most

Question

Post reply

Work out which column value is repeated the most

webskater

SSCarpal Tunnel

Points: 4214
More actions
February 29, 2012 at 5:08 pm

#251627

CREATE Table #Fred(OrganisationID int, LocationID int)
INSERT INTO #Fred(OrganisationID, LocationID)
VALUES(1,2)
INSERT INTO #Fred(OrganisationID, LocationID)
VALUES(1,2)
INSERT INTO #Fred(OrganisationID, LocationID)
VALUES(1,7)
INSERT INTO #Fred(OrganisationID, LocationID)
VALUES(1,7)
INSERT INTO #Fred(OrganisationID, LocationID)
VALUES(1,7)
INSERT INTO #Fred(OrganisationID, LocationID)
VALUES(1,5)
INSERT INTO #Fred(OrganisationID, LocationID)
VALUES(1,6)
INSERT INTO #Fred(OrganisationID, LocationID)
VALUES(2,18)
INSERT INTO #Fred(OrganisationID, LocationID)
VALUES(2,19)
INSERT INTO #Fred(OrganisationID, LocationID)
VALUES(2,19)
SELECT * FROM #Fred
DROP Table #Fred
The code above produces data that looks like:
1,2
1,2
1,7
1,7
1,7
1,5
1,6
2,18
2,19
2,19
In the second column - for any distinct value in the first column - how can I find out which value in the second column occurs most often - and update all the values in Column 2 to that value which occurs most often.
So, in the example above - how do I establish that, for value 1 in the first column, 7 occurs more often than 2, 5 and 6 - and update all rows where the value in the first column is 1 so that the value in the second column is 7.
And ... that, for value 2 in the first column, 19 occurs more often than 18 - and update all rows where the value in the first column is 2 so that the value in the second column is 19.
Etc.
I haven't shown it above, but each row has a Primary Key.
Thanks for any help.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 1

THis?

SELECT * FROM #Fred;

; WITH CTE AS

(

SELECT OrganisationID , LocationID ,CT = COUNT(*) --, Cnt = COUNT(*) OVER ( PARTITION BY OrganisationID , LocationID)

FROM #Fred

GROUP BY OrganisationID , LocationID

)

, Rankd AS

(

SELECT OrganisationID , LocationID

, RN = ROW_NUMBER() OVER (PARTITION BY OrganisationID ORDER BY CT DESC)

FROM CTE

)

, TopPerformer AS

(

SELECT OrganisationID , LocationID

FROM Rankd

WHERE RN = 1

)

UPDATE F

SET F.LocationID = TP.LocationID

FROM #Fred F

INNER JOIN TopPerformer TP

ON TP.OrganisationID = F.OrganisationID

SELECT * FROM #Fred;

webskater SSCarpal Tunnel Points: 4214 More actions · Answer 2

Wow, thanks very much. Yes, that's exactly what I need.

Seeing your code makes me realise what a novice I am at this lark.

Can I ask ...

Using that 'With' statement, you seem to be creating several temporary tables but you don't have to drop them - is that right?

And you start that statement with a semi-colon - and use commas to separate the statements that create the temporary tables?

Cheers

Cadavre SSC-Forever Points: 41690 More actions · Answer 3

sku370870 (3/1/2012)
Wow, thanks very much. Yes, that's exactly what I need.
Seeing your code makes me realise what a novice I am at this lark.
Can I ask ...
Using that 'With' statement, you seem to be creating several temporary tables but you don't have to drop them - is that right?
And you start that statement with a semi-colon - and use commas to separate the statements that create the temporary tables?
Cheers

Does it make it easier for you to understand if I rewrite his query like this?

UPDATE F

SET F.LocationID = updateValues.LocationID

FROM #Fred F

INNER JOIN (SELECT OrganisationID, LocationID

FROM (SELECT OrganisationID, LocationID, ROW_NUMBER() OVER (PARTITION BY OrganisationID ORDER BY Cnt DESC)

FROM (SELECT OrganisationID, LocationID, COUNT(*)

FROM #Fred

GROUP BY OrganisationID, LocationID) originalDataCount(OrganisationID, LocationID, Cnt)

) dataRanked(OrganisationID, LocationID, Rn)

WHERE RN = 1) updateValues ON updateValues.OrganisationID = F.OrganisationID;

Basically, a CTE is a derived table that can be referenced in your query. ColdCoffee has created 3 CTEs (named CTE, Rankd and TopPerformer) then used each one to build up the data he needed to update your table.

The semi-colon at the start of his query is actually bad when demonstrating the method used for creating a CTE. It's there to terminate the previous statement but is often written next to the "WITH" instead.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

webskater SSCarpal Tunnel Points: 4214 More actions · Answer 4

Thanks for your help in trying to get me to understand this.

I realise in your code you are doing up update query joining #fred to a table created by this statement.

SELECT OrganisationID, LocationID

FROM (SELECT OrganisationID, LocationID, ROW_NUMBER() OVER (PARTITION BY OrganisationID ORDER BY Cnt DESC)

FROM (SELECT OrganisationID, LocationID, COUNT(*) FROM #Fred

GROUP BY OrganisationID, LocationID)

originalDataCount(OrganisationID, LocationID, Cnt))

dataRanked(OrganisationID, LocationID, Rn)

WHERE RN = 1

Although I get the idea ... I don't understand the syntax at all ... there seem to be three consecutive 'FROMS'

I have only ever written

SELECT FieldA FROM TableB

I don't get .. SELECT something FROM something FROM something FROM #Fred

I'd be grateful for any further explanation - I'd really like to understand this stuff. Cheers.

Edit: Just as an aside - I spend most of time writing in C# in a .net environment ... where every line ends with a semi-colon or it won't compile.

In SQL Server I never use semi-colons to finish lines ... but you do?

Cadavre SSC-Forever Points: 41690 More actions · Answer 5

OK, I'll give it a go but I've been told before that my teaching method comes across as abrasive.

First, we want to figure out the counts for each organisationID and locationID. We do this by doing a simple count:

SELECT OrganisationID, LocationID, COUNT(*)

FROM #Fred

GROUP BY OrganisationID, LocationID

Which returns:

OrganisationID LocationID

-------------- ----------- -----------

1 2 2

1 5 1

1 6 1

1 7 3

2 18 1

2 19 2

We need to know which is the highest count for each orgnanisationID, so we use a ranking function, but we want to base it from the data above so we use a derived table.

SELECT OrganisationID, LocationID, ROW_NUMBER() OVER (PARTITION BY OrganisationID ORDER BY Cnt DESC)

FROM (

--This is your derived table

SELECT OrganisationID, LocationID, COUNT(*)

FROM #Fred

GROUP BY OrganisationID, LocationID

) derivedTable(OrganisationID, LocationID, Cnt) --alias for derived table and column names

Which returns:

OrganisationID LocationID

-------------- ----------- --------------------

1 7 1

1 2 2

1 5 3

1 6 4

2 19 1

2 18 2

Now, those with a ROW_NUMBER of 1 are the highest counts for each organisationID, so we want to use another derived table to grab the above data and filter where ROW_NUMBER is 1.

SELECT OrganisationID, LocationID

FROM (

--This is your derived table

SELECT OrganisationID, LocationID, ROW_NUMBER() OVER (PARTITION BY OrganisationID ORDER BY Cnt DESC)

FROM (

--This is your derived table

SELECT OrganisationID, LocationID, COUNT(*)

FROM #Fred

GROUP BY OrganisationID, LocationID

) derivedTable(OrganisationID, LocationID, Cnt) --alias for derived table and column names

) derivedTable2(OrganisationID, LocationID, Rnk) --alias for derived table and column names

WHERE Rnk = 1

Which returns:

OrganisationID LocationID

-------------- -----------

1 7

2 19

Now we can join this back to #Fred based on the organisationID, allowing us to use the highest locationID.

SELECT f.OrganisationID AS originalOrganisationID, f.LocationID AS originalLocationID,

derivedTable3.OrganisationID AS derivedOrganisationID, derivedTable3.LocationID AS derivedLocationID

FROM #Fred f

INNER JOIN (--This is your derived table

SELECT OrganisationID, LocationID

FROM (

--This is your derived table

SELECT OrganisationID, LocationID, ROW_NUMBER() OVER (PARTITION BY OrganisationID ORDER BY Cnt DESC)

FROM (

--This is your derived table

SELECT OrganisationID, LocationID, COUNT(*)

FROM #Fred

GROUP BY OrganisationID, LocationID

) derivedTable(OrganisationID, LocationID, Cnt) --alias for derived table and column names

) derivedTable2(OrganisationID, LocationID, Rnk) --alias for derived table and column names

WHERE Rnk = 1

) derivedTable3 ON f.OrganisationID = derivedTable3.OrganisationID

Which results in:

originalOrganisationID originalLocationID derivedOrganisationID derivedLocationID

---------------------- ------------------ --------------------- -----------------

1 2 1 7

1 7 1 7

1 5 1 7

1 6 1 7

2 18 2 19

2 19 2 19

Does that make it clearer? Derived tables are sort of essential to almost all SQL development. . .

sku370870 (3/1/2012)
Edit: Just as an aside - I spend most of time writing in C# in a .net environment ... where every line ends with a semi-colon or it won't compile.
In SQL Server I never use semi-colons to finish lines ... but you do?

No, I don't. But I should 😀

CTE is one of the first things in SQL that require the previous statement to have been terminated correctly.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 6

Cadavre (3/1/2012)
OK, I'll give it a go but I've been told before that my teaching method comes across as abrasive.

Not at all Cadavre! Awesome explanation! Thanks for that. Hope the OP sees the code clearer now.

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 7

I must also ask a question here. What should happen if there are 2 LocationIDs with same number of appearances for a particular OrgID ?

Jared SSC Guru Points: 61793 More actions · Answer 8

The semi-colon at the start of his query is actually bad when demonstrating the method used for creating a CTE. It's there to terminate the previous statement but is often written next to the "WITH" instead.

To the OP, although the semi-colon is used to terminate a preceding statement... a CTE must ALWAYS begin with one. This is why many people put it before the WITH. If there is no preceding statement, it still requires the semi-colon. So it has become practice to just place it before the WITH as demonstrated.

Jared
CE - Microsoft

GSquared SSC Guru Points: 260824 More actions · Answer 9

SQLKnowItAll (3/1/2012)
The semi-colon at the start of his query is actually bad when demonstrating the method used for creating a CTE. It's there to terminate the previous statement but is often written next to the "WITH" instead.
To the OP, although the semi-colon is used to terminate a preceding statement... a CTE must ALWAYS begin with one. This is why many people put it before the WITH. If there is no preceding statement, it still requires the semi-colon. So it has become practice to just place it before the WITH as demonstrated.

That's not fully accurate.

A CTE in a View definition not only doesn't require a semicolon, it won't work with one.

CREATE VIEW dbo.MyCTEView

AS

WITH CTE

AS (SELECT *

FROM ( VALUES ( 1), ( 2) ) MyTVF (Col))

SELECT *

FROM CTE ;

GO

SELECT *

FROM dbo.MyCTEView ;

A CTE at the beginning of a batch doesn't require a preceding semicolon.

WITH CTE

AS (SELECT *

FROM ( VALUES ( 1), ( 2) ) MyTVF (Col))

SELECT *

FROM CTE

A CTE as the first item in a stored procedure doesn't require one.

CREATE PROC dbo.MyCTEProc

AS

WITH CTE

AS (SELECT *

FROM ( VALUES ( 1), ( 2) ) MyTVF (Col))

SELECT *

FROM CTE

GO

EXEC dbo.MyCTEProc

A CTE in the definition of an inline-select UDF doesn't require one and won't work with one (just like a View).

They're only required when there's a prior DML or DDL statement in a script/batch/procedure/function, and the use is to terminate that item, not to start the CTE. The reason for this is that With is an overloaded word, which can be used to add options to some commands, or to define a CTE.

select *

from MyTable

with CTEName as (CTE definition)

Is that a syntax error in defining a table hint, or is it a CTE? The dev can tell, but the SQL compiler can't. Hence, a semicolon is needed to terminate the prior statement.

If it's the first item in a batch, use one or not, at your option. I generally do, out of habit, but not out of necessity.

Edit: Note, all scripts except the last one work. The last one deliberately does NOT work, and will generate an error, which is the point for it.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Jared SSC Guru Points: 61793 More actions · Answer 10

GSquared (3/1/2012)
SQLKnowItAll (3/1/2012)
The semi-colon at the start of his query is actually bad when demonstrating the method used for creating a CTE. It's there to terminate the previous statement but is often written next to the "WITH" instead.
To the OP, although the semi-colon is used to terminate a preceding statement... a CTE must ALWAYS begin with one. This is why many people put it before the WITH. If there is no preceding statement, it still requires the semi-colon. So it has become practice to just place it before the WITH as demonstrated.
That's not fully accurate.
A CTE in a View definition not only doesn't require a semicolon, it won't work with one.
CREATE VIEW dbo.MyCTEView
AS
WITH CTE
AS (SELECT *
FROM ( VALUES ( 1), ( 2) ) MyTVF (Col))
SELECT *
FROM CTE ;
GO
SELECT *
FROM dbo.MyCTEView ;
A CTE at the beginning of a batch doesn't require a preceding semicolon.
WITH CTE
AS (SELECT *
FROM ( VALUES ( 1), ( 2) ) MyTVF (Col))
SELECT *
FROM CTE
A CTE as the first item in a stored procedure doesn't require one.
CREATE PROC dbo.MyCTEProc
AS
WITH CTE
AS (SELECT *
FROM ( VALUES ( 1), ( 2) ) MyTVF (Col))
SELECT *
FROM CTE
GO
EXEC dbo.MyCTEProc
A CTE in the definition of an inline-select UDF doesn't require one and won't work with one (just like a View).
They're only required when there's a prior DML or DDL statement in a script/batch/procedure/function, and the use is to terminate that item, not to start the CTE. The reason for this is that With is an overloaded word, which can be used to add options to some commands, or to define a CTE.
select *
from MyTable
with CTEName as (CTE definition)
Is that a syntax error in defining a table hint, or is it a CTE? The dev can tell, but the SQL compiler can't. Hence, a semicolon is needed to terminate the prior statement.
If it's the first item in a batch, use one or not, at your option. I generally do, out of habit, but not out of necessity.
Edit: Note, all scripts except the last one work. The last one deliberately does NOT work, and will generate an error, which is the point for it.

Well heck! That's good to know! I was "taught" that a CTE must always have the semi-colon. Those who teach... 🙂 Thanks G!

Jared
CE - Microsoft

GSquared SSC Guru Points: 260824 More actions · Answer 11

To the question about the nested From statements:

Those are what are called "derived tables".

You can do those in the body of the "outer query":

select *

from

(select *

from dbo.MyTable) as DerivedTable ;

or in a "Common Table Expression" (CTE):

with CTE (ColumnNames) as

(select *

from dbo.MyTable)

select *

from CTE ;

The advantage to CTEs is that they can be recursive. That means a CTE can query itself. This is used to query hierarchies, where each level in the hierarchy needs to query each level below it, till there are no more levels left, for example. There are other uses, but that's the most common good one. (There are a LOT of bad ways to use recursive CTEs, mainly because they can become serious performance killers.)

The other advantage to CTEs is that many people find them easier to read than the inline ones in the From clause. Not everyone benefits from this, but it is common.

Anything you can do in a regular query, you can do in a derived table. This can be very advantageous in breaking code up to make it more readable, without having to do things like nest views inside each other (which does the same thing as a derived table, but hides it, also often hiding performance-inhibiting code at the same time).

Does that help?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

webskater SSCarpal Tunnel Points: 4214 More actions · Answer 12

Thank you very much for taking the time to explain that - and - I understand it!

I will read, re-read and write if for myself with different tables and columns so that I can learn how to apply it without getting a pen and paper out and drawing tables all over it.

I've been using SQL Server for years now - and although I often join to what I call 'tables created on the fly with a select statement' - which, I think, is a 'derived table'(?), I had no idea you could use derived tables as you have - that reference the one above etc.

I rarely use cursors in stored procedures - seem to manage set based for most things. But, sometimes, when needing to manipulate data on a one off basis - I would usually create a temp table - bung it in cursor, loop through, create a second temp table, loop through etc. - gradually working my way down to get the result set I need - so I can do an update or whatever.

But your code has opened my eyes - I really must learn this stuff properly. It's a lot more interesting than boring .net front end stuff.

Edit: this reply was supposed to be a reply to Cadavre for his explanation ... and I want to thank Cold Coffee again for his original response and further explanations .. and SQLKnowItAll and GSQuared (wow, beginning to sound like an acceptance speech at the Oscars!) - anyway many thanks to all.

My real world example is a table with about 40,000 contacts belonging to about 3000 organisations where each organisation has between 1 and 10 locations - with the organisations / locations / contacts belong to one of about 200 projects. In the table (imported from another database) many of the contacts are assigned to lots of different locations for the same organisation. I need to update the data so that each contact is only assigned to one location for each organisation - and I want to assign them to the one that appears most often for that organisation.

Now, I can do it without a severe headache and about 4 nested cursors and temp tables.

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 13

ColdCoffee (3/1/2012)
I must also ask a question here. What should happen if there are 2 LocationIDs with same number of appearances for a particular OrgID ?

Hey there folks. ColdCoffee asked a really valuable question here. If you think not, try using the following data and tell me whether or not the "1"s should be updated to 1,5 or 1,7.

CREATE Table #Fred(OrganisationID int, LocationID int)