September 25, 2011 at 10:29 am
I want to select the id and name of all companies in [prospects] table that have the same name and occur more than once.
So if Microsoft and Google occur more than once I want as a result:
[prospects]
id companyname
2 Microsoft
5 Microsoft
7 Microsoft
10 Google
14 Google
I now have the following sql, but that only returns the companyname once:
SELECT id,companyname,COUNT(companyname) AS NumOccurrences FROM prospects P
GROUP BY companyname,id
HAVING ( COUNT(companyname) > 1 )
September 25, 2011 at 4:35 pm
First of all, Joe is absolutely correct... you'll get better answers more quickly if you invest just a minute or two in how you post your data. Please see the first link in my signature line at the bottom of this post for how to best do that. It's also the polite thing to do so people providing answers can actually test their code to make sure it does what you want it to. 😉
Joe is also correct about the "ID" column naming. It's pretty much non-descriptive and can become terribly confusing when you're joining multiple tables.
That, notwithstanding, I left your column names as you provided them in the following test code setup. The only thing I changed was I used a Temp Table to test with so we don't accidently drop a real table in the process...
Here's how we'd like to see code posted in the future... It instantly answers nearly any question that anyone could have about the data and table(s) involved with your request.
--========================================================
-- Create the test data. Nothing in this section has
-- anything to do with the solution to the problem.
-- We're just creating test data here.
-- THIS IS HOW YOU SHOULD POST DATA IN THE FUTURE!!!!
--========================================================
--===== Conditionally drop the test table to make reruns
-- easier in SSMS.
IF OBJECT_ID('tempdb..#Prospects','U') IS NOT NULL
DROP TABLE #Prospects
;
--===== Create the table.
CREATE TABLE #Prospects
(
ID INT NOT NULL,
CompanyName VARCHAR(20) NOT NULL
)
;
--===== Populate the table with test data for
-- both positive AND negative results.
INSERT INTO #Prospects
(ID, CompanyName)
SELECT 2,'Microsoft' UNION ALL
SELECT 5,'Microsoft' UNION ALL
SELECT 7,'Microsoft' UNION ALL
SELECT 10,'Google' UNION ALL
SELECT 14,'Google' UNION ALL
SELECT 22,'SomeCompany1' UNION ALL
SELECT 24,'SomeCompany2'
;
--===== Build the PK and indexes
ALTER TABLE #Prospects
ADD PRIMARY KEY (ID)
;
-- This is a nice little covering index
CREATE INDEX IX_#Prospect_CompanyName
ON #Prospects (CompanyName, ID)
;
I tried using a couple of the classic methods for solving this problem and threw Joe's code (after changing his code to your column names) into the mix, as well. Here's that test code...
PRINT '--===== Solve the problem using just one of the half dozen ways to do this. ===';
SET STATISTICS IO, TIME ON;
SELECT ID, CompanyName
FROM #Prospects
WHERE CompanyName IN
( --=== Find companies with multiple occurrences
SELECT CompanyName
FROM #Prospects p2
GROUP BY CompanyName
HAVING COUNT(*) > 1
)
ORDER BY CompanyName, ID;
SET STATISTICS IO, TIME OFF;
------------------------------------------------------------------------------------------
PRINT '--===== Solve the problem using another of the half dozen ways to do this. ===';
SET STATISTICS IO, TIME ON;
SELECT ID, CompanyName
FROM #Prospects p1
WHERE EXISTS
( --=== Find companies with multiple occurrences
SELECT 1
FROM #Prospects p2
WHERE p1.CompanyName = p2.CompanyName
GROUP BY CompanyName
HAVING COUNT(ID) > 1
)
ORDER BY CompanyName, ID;
SET STATISTICS IO, TIME OFF;
------------------------------------------------------------------------------------------
PRINT '--===== Joe''s method =========================================================';
SET STATISTICS IO, TIME ON;
SELECT x.ID, x.CompanyName
FROM (
SELECT ID,
CompanyName,
COUNT(*) OVER (PARTITION BY CompanyName) AS CompanyOccurrences
FROM #Prospects
) AS x
WHERE CompanyOccurrences > 1;
SET STATISTICS IO, TIME OFF;
Here's the output from the messages tab. See the "Worktable" entry and all of the READs associated with it? 31 page reads to cover only 7 rows of data? That smacks of a "Table Spool" and what I call "Hidden RBAR". "Hidden RBAR" is a very bad thing because it will crush performance if a larger number of rows is involved. While it doesn't look bad below, it can get a lot worse in a hurry.
--===== Solve the problem using just one of the half dozen ways to do this. ===
(5 row(s) affected)
Table '#Prospects__________________________________________________________________________________________________________00000000003D'.
Scan count 3, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
--===== Solve the problem using another of the half dozen ways to do this. ===
(5 row(s) affected)
Table '#Prospects__________________________________________________________________________________________________________00000000003D'.
Scan count 3, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
--===== Joe's method =========================================================
(5 row(s) affected)
Table 'Worktable'. Scan count 3, logical reads 31, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Prospects__________________________________________________________________________________________________________00000000003D'.
Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Ok... let's see how bad the "Hidden RBAR" can get. I know you probably don't have a million rows of data but it'll show just how bad the "Hidden RBAR" in the code actually is. Here's the code to build a million row test table for the problem at hand...
--========================================================
-- Create the test data. Nothing in this section has
-- anything to do with the solution to the problem.
-- We're just creating test data here.
--========================================================
--===== Conditionally drop the test table to make reruns
-- easier in SSMS.
IF OBJECT_ID('tempdb..#Prospects','U') IS NOT NULL
DROP TABLE #Prospects
;
--===== Create and build a large test table on the fly
SELECT TOP 1000000
ID = IDENTITY(INT,1,1), --Creates NOT NULL column
CompanyName = ISNULL(LEFT(NEWID(),7),'') --Creates NOT NULL column
INTO #Prospects
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
--===== Build the PK and indexes
ALTER TABLE #Prospects
ADD PRIMARY KEY (ID)
;
CREATE INDEX IX_#Prospect_CompanyName
ON #Prospects (CompanyName, ID)
;
When we run the test code against THAT test data, we find out that the "Hidden RBAR" method is actually more than 10 times slower than the two classic methods and is thousands of times more aggressive for reads than the two classic methods. Here's the output that shows that.
--===== Solve the problem using just one of the half dozen ways to do this. ===
(3759 row(s) affected)
Table '#Prospects__________________________________________________________________________________________________________00000000003C'.
Scan count 2, logical reads 4474, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 2062 ms, elapsed time = 2404 ms.
--===== Solve the problem using another of the half dozen ways to do this. ===
(3759 row(s) affected)
Table '#Prospects__________________________________________________________________________________________________________00000000003C'.
Scan count 2, logical reads 4474, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 2016 ms, elapsed time = 2319 ms.
--===== Joe's method =========================================================
(3759 row(s) affected)
Table 'Worktable'. Scan count 3, logical reads 5992481, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Prospects__________________________________________________________________________________________________________00000000003C'.
Scan count 1, logical reads 2237, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 26203 ms, elapsed time = 27219 ms.
Please let us know if you have any additional questions on this problem or any of the solutions provided so far.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 25, 2011 at 4:44 pm
CELKO (9/25/2011)
>> I want to select the prospect_nbr and name of all companies in [prospects] table that have the same name and occur more than once. <<Please be polite enough to post DDL for us. And some sample data is nice, too. Since there is no such thing as a magical, universal, generic “id”, can I assume that the prospects are on a call list of some kinds with a ticket number?
That's actually a pretty nice post, Joe. Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply