January 22, 2008 at 8:14 pm
I have a database with products and their (unique) serial numbers. I would like a list of the serial numbers not present, or the serial numbers that border the gaps.
For instance, consider a list of:
1, 2, 3, 5, 6, 7, 8, 9
What ways could I discover "4" or "3,5"?
I'm thinking of creating a table with the complete numerical sequence and performing an OUTER JOIN (which has it's own side benefits), but there has to be a more elegant solution.
Ideas?
January 22, 2008 at 8:25 pm
There is... assuming that your serial numbers are actually integers...
SELECT GapStart = (SELECT ISNULL(MAX(b.SerialNumber),0)+1
FROM #yourtable b
WHERE b.SerialNumber< a.SerialNumber),
GapEnd = SerialNumber- 1
FROM #yourtable a
WHERE a.SerialNumber- 1 NOT IN (SELECT SerialNumberFROM #yourtable)
AND a.SerialNumber- 1 > 0
--Jeff Moden
Change is inevitable... Change for the better is not.
January 23, 2008 at 5:46 am
Here's another way
WITH CTE AS (
SELECT SerialNumber,
ROW_NUMBER() OVER(ORDER BY SerialNumber) AS rn
FROM #yourtable)
SELECT s.SerialNumber+1 AS GapStart,
e.SerialNumber-1 AS GapEnd
FROM CTE s
INNER JOIN CTE e ON s.rn+1=e.rn
AND s.SerialNumber+1<e.SerialNumber
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537January 23, 2008 at 6:09 am
Nice.... just be a bit careful... haven't tested it but it looks like if the serial numbers are large, that'll take a while to resolve.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 23, 2008 at 6:47 am
Why go for the hard way, and what's so non-elegant about an outer join..?
To find the actual gaps - ie the numbers that's really missing, an outer join and a numbers table will do very nicely...
selectn.number as 'missingProdNumbers'
frommyNumbersTable n
left join myProductsTable p
onn.number = p.productNumber
wherep.productNumber is null
Very straightforward, and very simple. 🙂
/Kenneth
January 23, 2008 at 7:04 am
Absolutely agree... if the numbers table is large enough... if it's not, then you can do something like the following...
... Also, in 2k5, you really don't need a recursive CTE to create a large list of numbers... the following will create a list of numbers with a million rows in about 844 milliseconds... and, it allows you some range control over what the numbers will be...
SET Statistics TIME ON
DECLARE @Bitbucket INT --Just for testing Tally creation speed... remove this for prod
--===== Declare some local variables that could be parameters in a proc
DECLARE @StartNumber INT
DECLARE @EndNumber INT
--===== Set those "parameters" for demonstration purposes
SET @StartNumber = 1000000 --Inclusive
SET @EndNumber = 2000000 --Inclusive
; WITH cTally AS
(-----------------------------------------------------------------------------
--==== High performance CTE equivalent of a Tally or Numbers table
SELECT TOP (@EndNumber-@StartNumber+1)
ROW_NUMBER() OVER (ORDER BY t1.Object_ID) +@StartNumber -1 AS N
FROM Master.sys.ALL_Columns t1
CROSS JOIN Master.sys.ALL_Columns t2
)-----------------------------------------------------------------------------
SELECT @Bitbucket = N FROM cTally --Do your outer join with table being checked here
--Jeff Moden
Change is inevitable... Change for the better is not.
January 23, 2008 at 7:06 am
And, I haven't done the comparison lately, but I believe the first method I showed will be the pants off most outer join methods...
--Jeff Moden
Change is inevitable... Change for the better is not.
January 23, 2008 at 7:10 am
ahem...
With a large enough set of numbers - wouldn't you want to be able to leverage an index on your tally table? Although I conceptually like the CTE - it essentially returns an unindexed heap of a temp table. Again - great for smallish stuff, but surely it doesn't match performance of a "real" tally table?
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
January 23, 2008 at 7:12 am
Sounds like we might need a test to find out....
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
January 23, 2008 at 7:26 am
I was thinking the same thing... 😉
Also, Matt... your the Ninja on Regex... is there anyway to use Regex for such a thing as this import?
--Jeff Moden
Change is inevitable... Change for the better is not.
January 23, 2008 at 7:31 am
I don't think it would be efficient. It would be something like a recursive back-reference, and then you'd be doing math on string values, etc... A tally table is both much simpler and (gut feeling) will steamroll right over it.
I will put some thought into it though.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
January 23, 2008 at 7:45 am
create table dbo.SerialNumbers (ID int primary key)
go
insert into dbo.serialnumbers
select number
from common.dbo.numbers
where number between 1 and 100
go
delete from dbo.serialnumbers
where id in (55, 56, 71, 80, 99)
Opened a separate connection.
set statistics time on
set statistics io on
SELECT GapStart =
(SELECT ISNULL(MAX(b.ID),0)+1
FROM dbo.serialnumbers b
WHERE b.ID< a.ID),
GapEnd = ID- 1
FROM dbo.serialnumbers a
WHERE a.ID - 1 NOT IN
(SELECT ID
FROM dbo.serialnumbers)
AND a.ID - 1 > 0
Estimated Cost of the final query: 0.0163148
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
(4 row(s) affected)
Table 'SerialNumbers'. Scan count 6, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Next query (separate connection):
select number
from common.dbo.numbers
left outer join dbo.serialnumbers
on number = id
where number between 1 and 100
and id is null
Estimated Cost: 0.0128453
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
(5 row(s) affected)
Table 'SerialNumbers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Numbers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Conclusion:
The join to a Numbers table performs better, because of less reads. But the margin is tiny on such a small number set.
Increased the table to 10,000 rows, deleted another five semi-random records.
Cost on first query rises to: 0.108881; run time increases to 11 ms, Scan count 10, logical reads 54
Cost on numbers table query rises to: 0.012861; run time still 1 ms, Scan count 1, logical reads 2
On more complex tables, with more rows, the costs would be affected, but the numbers table join would still be less expensive.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 23, 2008 at 7:55 am
Decided to also try:
select number
from common.dbo.numbers
left outer join dbo.serialnumbers
on number = id
where number >=
(select min(id)
from dbo.serialnumbers)
and number <=
(select max(id)
from dbo.serialnumbers)
and id is null
Because that way I'm not assuming I already know the range of numbers to test.
Still using 10,000 as top ID in dbo.SerialNumbers, still missing 10 semi-random rows.
Cost increases to: 0.097948
SQL Server parse and compile time:
CPU time = 11 ms, elapsed time = 11 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
(10 row(s) affected)
Table 'SerialNumbers'. Scan count 3, logical reads 23, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Numbers'. Scan count 1, logical reads 19, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 9 ms.
Still lower cost than the more complex query, but significantly higher.
select number
from common.dbo.numbers
left outer join dbo.serialnumbers
on number = id
where number >= 1
and number <=
(select max(id)
from dbo.serialnumbers)
and id is null
Cost: 0.0868469 (lower)
SQL Server parse and compile time:
CPU time = 1 ms, elapsed time = 1 ms.
(10 row(s) affected)
Table 'SerialNumbers'. Scan count 2, logical reads 21, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Numbers'. Scan count 1, logical reads 19, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 8 ms.
Since the first query assumed that the lowest number was 1, this version does too (more fair). Cost goes down, as well as reads and execution time. Still lower cost than the more complex query.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 23, 2008 at 9:06 am
The numbers in the cost isn't the most significant thing to compare with in this case.
What you want to look at is the scan count, where scan count = 1 is the most optimal you can get.
/Kenneth
January 23, 2008 at 7:40 pm
I gotta agree with Kenneth... and, I'll throw in that CPU time matters as well. I thought the example code I posted that didn't use a Tally table would really do the trick... it does if you wanna drive the disk nuts 😛
Instead of messing around with a 100 or even 10,000 rows, I thought I'd put the two methods to a real life test with 6 Million rows. Takes a couple of minutes to setup... here's the setup code for both the Monster Tally table and the Monster Serial Number table...
--===== Setup for speed and to prevent blocking
    SET NOCOUNT ON
    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
--===== Create a 6 million row Tally table...
--===== Create and populate the Tally table on the fly
 SELECT TOP 6000000
        IDENTITY(INT,1,1) AS N
   INTO dbo.Tally
   FROM Master.dbo.SysColumns sc1,
        Master.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance
  ALTER TABLE dbo.Tally
    ADD CONSTRAINT PK_Tally_N 
        PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
--=============================================================================
--      Create an experimental table to simulate the table being examined
--      Again... 6 million rows...
--=============================================================================
--===== Create the experimental temp table and populate with Serial #'s on the fly
     -- This works because SysColumns always has at least 4000 entries
     -- even in a new database and 4000*4000 = 16,000,000
 SELECT TOP 6000000 SerialNumber = IDENTITY(INT, 1, 1)
   INTO dbo.yourtable
   FROM Master.dbo.SysColumns sc1,
        Master.dbo.SysColumns sc2
-- --===== Like any good table, our experimental table needs a Primary Key
  ALTER TABLE dbo.yourtable
        ADD PRIMARY KEY CLUSTERED (SerialNumber)
     -- This deletes a "monster" range just to see how it's handled.
 DELETE dbo.yourtable
  WHERE SerialNumber BETWEEN 5000000 AND 5500000
     -- This deletes every third row
 DELETE dbo.yourtable
  WHERE SerialNumber %3 = 0
[/font]
... here's the two methods running one right after the other with IO and CPU times turned on...
--      Test the two methods with IO and CPU measurements turned on
--=============================================================================
SET STATISTICS IO ON
SET STATISTICS TIME ON
--===== Calculated Gaps
 SELECT GapStart = (SELECT ISNULL(MAX(b.SerialNumber),0)+1
                      FROM dbo.yourtable b 
                     WHERE b.SerialNumber < a.SerialNumber),
        GapEnd = SerialNumber - 1  
   FROM dbo.yourtable a
  WHERE a.SerialNumber - 1 NOT IN (SELECT SerialNumber FROM dbo.yourtable)
    AND a.SerialNumber - 1 > 0
PRINT REPLICATE('=',100)
--===== Tally table gaps
 SELECT t.N
   FROM dbo.Tally t WITH (NOLOCK)
   LEFT OUTER JOIN dbo.yourtable y
     ON t.N = y.SerialNumber
  WHERE y.SerialNumber IS NULL
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
[/font]
... and here's the results... Tally table wins by a mile!
====================================================================================================
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table 'yourtable'. Scan count 1833335, logical reads 5520639, physical reads 0, read-ahead reads 0.
SQL Server Execution Times:
CPU time = 28297 ms, elapsed time = 44224 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table 'Tally'. Scan count 1, logical reads 9649, physical reads 1, read-ahead reads 8293.
Table 'yourtable'. Scan count 1, logical reads 8846, physical reads 0, read-ahead reads 0.
SQL Server Execution Times:
CPU time = 5578 ms, elapsed time = 18606 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply