Listing gaps in numeric sequences

Question

Post reply

Listing gaps in numeric sequences

jschroeder

SSC Eights!

Points: 807
More actions
January 22, 2008 at 8:14 pm

#202501

I have a database with products and their (unique) serial numbers. I would like a list of the serial numbers not present, or the serial numbers that border the gaps.
For instance, consider a list of:
1, 2, 3, 5, 6, 7, 8, 9
What ways could I discover "4" or "3,5"?
I'm thinking of creating a table with the complete numerical sequence and performing an OUTER JOIN (which has it's own side benefits), but there has to be a more elegant solution.
Ideas?

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 1

There is... assuming that your serial numbers are actually integers...

SELECT GapStart = (SELECT ISNULL(MAX(b.SerialNumber),0)+1

FROM #yourtable b

WHERE b.SerialNumber< a.SerialNumber),

GapEnd = SerialNumber- 1

FROM #yourtable a

WHERE a.SerialNumber- 1 NOT IN (SELECT SerialNumberFROM #yourtable)

AND a.SerialNumber- 1 > 0

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mark Cowne One Orange Chip Points: 26968 More actions · Answer 2

Here's another way

WITH CTE AS (

SELECT SerialNumber,

ROW_NUMBER() OVER(ORDER BY SerialNumber) AS rn

FROM #yourtable)

SELECT s.SerialNumber+1 AS GapStart,

e.SerialNumber-1 AS GapEnd

FROM CTE s

INNER JOIN CTE e ON s.rn+1=e.rn

AND s.SerialNumber+1<e.SerialNumber

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 3

Nice.... just be a bit careful... haven't tested it but it looks like if the serial numbers are large, that'll take a while to resolve.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Kenneth Wilhelmsson SSC-Dedicated Points: 30043 More actions · Answer 4

Why go for the hard way, and what's so non-elegant about an outer join..?

To find the actual gaps - ie the numbers that's really missing, an outer join and a numbers table will do very nicely...

selectn.number as 'missingProdNumbers'

frommyNumbersTable n

left join myProductsTable p

onn.number = p.productNumber

wherep.productNumber is null

Very straightforward, and very simple. 🙂

/Kenneth

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 5

Absolutely agree... if the numbers table is large enough... if it's not, then you can do something like the following...

... Also, in 2k5, you really don't need a recursive CTE to create a large list of numbers... the following will create a list of numbers with a million rows in about 844 milliseconds... and, it allows you some range control over what the numbers will be...

SET Statistics TIME ON

DECLARE @Bitbucket INT --Just for testing Tally creation speed... remove this for prod

--===== Declare some local variables that could be parameters in a proc

DECLARE @StartNumber INT

DECLARE @EndNumber INT

--===== Set those "parameters" for demonstration purposes

SET @StartNumber = 1000000 --Inclusive

SET @EndNumber = 2000000 --Inclusive

; WITH cTally AS

(-----------------------------------------------------------------------------

--==== High performance CTE equivalent of a Tally or Numbers table

SELECT TOP (@EndNumber-@StartNumber+1)

ROW_NUMBER() OVER (ORDER BY t1.Object_ID) +@StartNumber -1 AS N

FROM Master.sys.ALL_Columns t1

CROSS JOIN Master.sys.ALL_Columns t2

)-----------------------------------------------------------------------------

SELECT @Bitbucket = N FROM cTally --Do your outer join with table being checked here

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 6

And, I haven't done the comparison lately, but I believe the first method I showed will be the pants off most outer join methods...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 7

ahem...

With a large enough set of numbers - wouldn't you want to be able to leverage an index on your tally table? Although I conceptually like the CTE - it essentially returns an unindexed heap of a temp table. Again - great for smallish stuff, but surely it doesn't match performance of a "real" tally table?

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 8

Sounds like we might need a test to find out....

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 9

I was thinking the same thing... 😉

Also, Matt... your the Ninja on Regex... is there anyway to use Regex for such a thing as this import?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 10

I don't think it would be efficient. It would be something like a recursive back-reference, and then you'd be doing math on string values, etc... A tally table is both much simpler and (gut feeling) will steamroll right over it.

I will put some thought into it though.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

GSquared SSC Guru Points: 260824 More actions · Answer 11

create table dbo.SerialNumbers (ID int primary key)

go

insert into dbo.serialnumbers

select number

from common.dbo.numbers

where number between 1 and 100

go

delete from dbo.serialnumbers

where id in (55, 56, 71, 80, 99)

Opened a separate connection.

set statistics time on

set statistics io on

SELECT GapStart =

(SELECT ISNULL(MAX(b.ID),0)+1

FROM dbo.serialnumbers b

WHERE b.ID< a.ID),

GapEnd = ID- 1

FROM dbo.serialnumbers a

WHERE a.ID - 1 NOT IN

(SELECT ID

FROM dbo.serialnumbers)

AND a.ID - 1 > 0

Estimated Cost of the final query: 0.0163148

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 1 ms.

(4 row(s) affected)

Table 'SerialNumbers'. Scan count 6, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 1 ms.

Next query (separate connection):

select number

from common.dbo.numbers

left outer join dbo.serialnumbers

on number = id

where number between 1 and 100

and id is null

Estimated Cost: 0.0128453

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 1 ms.

(5 row(s) affected)

Table 'SerialNumbers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Numbers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 1 ms.

Conclusion:

The join to a Numbers table performs better, because of less reads. But the margin is tiny on such a small number set.

Increased the table to 10,000 rows, deleted another five semi-random records.

Cost on first query rises to: 0.108881; run time increases to 11 ms, Scan count 10, logical reads 54

Cost on numbers table query rises to: 0.012861; run time still 1 ms, Scan count 1, logical reads 2

On more complex tables, with more rows, the costs would be affected, but the numbers table join would still be less expensive.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 12

Decided to also try:

select number

from common.dbo.numbers

left outer join dbo.serialnumbers

on number = id

where number >=

(select min(id)

from dbo.serialnumbers)

and number <=

(select max(id)

from dbo.serialnumbers)

and id is null

Because that way I'm not assuming I already know the range of numbers to test.

Still using 10,000 as top ID in dbo.SerialNumbers, still missing 10 semi-random rows.

Cost increases to: 0.097948

SQL Server parse and compile time:

CPU time = 11 ms, elapsed time = 11 ms.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 1 ms.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 1 ms.

(10 row(s) affected)

Table 'SerialNumbers'. Scan count 3, logical reads 23, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Numbers'. Scan count 1, logical reads 19, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 16 ms, elapsed time = 9 ms.

Still lower cost than the more complex query, but significantly higher.

select number

from common.dbo.numbers

left outer join dbo.serialnumbers

on number = id

where number >= 1

and number <=

(select max(id)

from dbo.serialnumbers)

and id is null

Cost: 0.0868469 (lower)

SQL Server parse and compile time:

CPU time = 1 ms, elapsed time = 1 ms.

(10 row(s) affected)

Table 'SerialNumbers'. Scan count 2, logical reads 21, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Numbers'. Scan count 1, logical reads 19, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 8 ms.

Since the first query assumed that the lowest number was 1, this version does too (more fair). Cost goes down, as well as reads and execution time. Still lower cost than the more complex query.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Kenneth Wilhelmsson SSC-Dedicated Points: 30043 More actions · Answer 13

The numbers in the cost isn't the most significant thing to compare with in this case.

What you want to look at is the scan count, where scan count = 1 is the most optimal you can get.

/Kenneth

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 14

I gotta agree with Kenneth... and, I'll throw in that CPU time matters as well. I thought the example code I posted that didn't use a Tally table would really do the trick... it does if you wanna drive the disk nuts 😛

Instead of messing around with a 100 or even 10,000 rows, I thought I'd put the two methods to a real life test with 6 Million rows. Takes a couple of minutes to setup... here's the setup code for both the Monster Tally table and the Monster Serial Number table...

[font="Courier New"]--drop&nbsptable&nbsptally

--=====&nbspSetup&nbspfor&nbspspeed&nbspand&nbspto&nbspprevent&nbspblocking

&nbsp&nbsp&nbsp&nbspSET&nbspNOCOUNT&nbspON

&nbsp&nbsp&nbsp&nbspSET&nbspTRANSACTION&nbspISOLATION&nbspLEVEL&nbspREAD&nbspUNCOMMITTED

--=====&nbspCreate&nbspa&nbsp6&nbspmillion&nbsprow&nbspTally&nbsptable...

--=====&nbspCreate&nbspand&nbsppopulate&nbspthe&nbspTally&nbsptable&nbspon&nbspthe&nbspfly

&nbspSELECT&nbspTOP&nbsp6000000

&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspIDENTITY(INT,1,1)&nbspAS&nbspN

&nbsp&nbsp&nbspINTO&nbspdbo.Tally

&nbsp&nbsp&nbspFROM&nbspMaster.dbo.SysColumns&nbspsc1,

&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspMaster.dbo.SysColumns&nbspsc2

--=====&nbspAdd&nbspa&nbspPrimary&nbspKey&nbspto&nbspmaximize&nbspperformance

&nbsp&nbspALTER&nbspTABLE&nbspdbo.Tally

&nbsp&nbsp&nbsp&nbspADD&nbspCONSTRAINT&nbspPK_Tally_N&nbsp

&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspPRIMARY&nbspKEY&nbspCLUSTERED&nbsp(N)&nbspWITH&nbspFILLFACTOR&nbsp=&nbsp100

--=============================================================================

--&nbsp&nbsp&nbsp&nbsp&nbsp&nbspCreate&nbspan&nbspexperimental&nbsptable&nbspto&nbspsimulate&nbspthe&nbsptable&nbspbeing&nbspexamined

--&nbsp&nbsp&nbsp&nbsp&nbsp&nbspAgain...&nbsp6&nbspmillion&nbsprows...

--=============================================================================

--=====&nbspCreate&nbspthe&nbspexperimental&nbsptemp&nbsptable&nbspand&nbsppopulate&nbspwith&nbspSerial&nbsp#'s&nbspon&nbspthe&nbspfly

&nbsp&nbsp&nbsp&nbsp&nbsp--&nbspThis&nbspworks&nbspbecause&nbspSysColumns&nbspalways&nbsphas&nbspat&nbspleast&nbsp4000&nbspentries

&nbsp&nbsp&nbsp&nbsp&nbsp--&nbspeven&nbspin&nbspa&nbspnew&nbspdatabase&nbspand&nbsp4000*4000&nbsp=&nbsp16,000,000

&nbspSELECT&nbspTOP&nbsp6000000&nbspSerialNumber&nbsp=&nbspIDENTITY(INT,&nbsp1,&nbsp1)

&nbsp&nbsp&nbspINTO&nbspdbo.yourtable

&nbsp&nbsp&nbspFROM&nbspMaster.dbo.SysColumns&nbspsc1,

&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspMaster.dbo.SysColumns&nbspsc2

--&nbsp--=====&nbspLike&nbspany&nbspgood&nbsptable,&nbspour&nbspexperimental&nbsptable&nbspneeds&nbspa&nbspPrimary&nbspKey

&nbsp&nbspALTER&nbspTABLE&nbspdbo.yourtable

&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspADD&nbspPRIMARY&nbspKEY&nbspCLUSTERED&nbsp(SerialNumber)

&nbsp&nbsp&nbsp&nbsp&nbsp--&nbspThis&nbspdeletes&nbspa&nbsp"monster"&nbsprange&nbspjust&nbspto&nbspsee&nbsphow&nbspit's&nbsphandled.

&nbspDELETE&nbspdbo.yourtable

&nbsp&nbspWHERE&nbspSerialNumber&nbspBETWEEN&nbsp5000000&nbspAND&nbsp5500000

&nbsp&nbsp&nbsp&nbsp&nbsp--&nbspThis&nbspdeletes&nbspevery&nbspthird&nbsprow

&nbspDELETE&nbspdbo.yourtable

&nbsp&nbspWHERE&nbspSerialNumber&nbsp%3&nbsp=&nbsp0

[/font]

... here's the two methods running one right after the other with IO and CPU times turned on...

[font="Courier New"]--=============================================================================

--&nbsp&nbsp&nbsp&nbsp&nbsp&nbspTest&nbspthe&nbsptwo&nbspmethods&nbspwith&nbspIO&nbspand&nbspCPU&nbspmeasurements&nbspturned&nbspon

--=============================================================================

SET&nbspSTATISTICS&nbspIO&nbspON

SET&nbspSTATISTICS&nbspTIME&nbspON

--=====&nbspCalculated&nbspGaps

&nbspSELECT&nbspGapStart&nbsp=&nbsp(SELECT&nbspISNULL(MAX(b.SerialNumber),0)+1

&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspFROM&nbspdbo.yourtable&nbspb&nbsp

&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspWHERE&nbspb.SerialNumber&nbsp<&nbspa.SerialNumber),

&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspGapEnd&nbsp=&nbspSerialNumber&nbsp-&nbsp1&nbsp&nbsp

&nbsp&nbsp&nbspFROM&nbspdbo.yourtable&nbspa

&nbsp&nbspWHERE&nbspa.SerialNumber&nbsp-&nbsp1&nbspNOT&nbspIN&nbsp(SELECT&nbspSerialNumber&nbspFROM&nbspdbo.yourtable)

&nbsp&nbsp&nbsp&nbspAND&nbspa.SerialNumber&nbsp-&nbsp1&nbsp>&nbsp0

PRINT&nbspREPLICATE('=',100)

--=====&nbspTally&nbsptable&nbspgaps

&nbspSELECT&nbspt.N

&nbsp&nbsp&nbspFROM&nbspdbo.Tally&nbspt&nbspWITH&nbsp(NOLOCK)

&nbsp&nbsp&nbspLEFT&nbspOUTER&nbspJOIN&nbspdbo.yourtable&nbspy

&nbsp&nbsp&nbsp&nbsp&nbspON&nbspt.N&nbsp=&nbspy.SerialNumber

&nbsp&nbspWHERE&nbspy.SerialNumber&nbspIS&nbspNULL

SET&nbspSTATISTICS&nbspIO&nbspOFF

SET&nbspSTATISTICS&nbspTIME&nbspOFF

[/font]

... and here's the results... Tally table wins by a mile!

====================================================================================================

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'yourtable'. Scan count 1833335, logical reads 5520639, physical reads 0, read-ahead reads 0.

SQL Server Execution Times:

CPU time = 28297 ms, elapsed time = 44224 ms.

====================================================================================================

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'Tally'. Scan count 1, logical reads 9649, physical reads 1, read-ahead reads 8293.

Table 'yourtable'. Scan count 1, logical reads 8846, physical reads 0, read-ahead reads 0.

SQL Server Execution Times:

CPU time = 5578 ms, elapsed time = 18606 ms.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)