Find missing numbers in sequential column

  • I'm pulling data from a legacy system to SQL Server.  Let's say I have a table with one column named Manifest and it is populated from this data.  In our legacy system, this is a sequential number.  So I should have sequential numbers in my table.

    I'm having a problem with missing data.  To track it down, I want to display the Manifest numbers that are missing in the sequence.  If the data in my column is, for example:

    423704

    423703

    423702

    423701

    423699

    423698

    423697

    423696

    423695

    Then I want the results of the query to be:

    423700

    Is this doable?  TIA......Cindy

     

  • from table a left outer join table b on a.key = b.key -1 where b.key is null

    gets you last row BEFORE a break

    reverse to get last row AFTER a break in sequence and merge these two to get empty ranges

     

  • Create a number table with numbers 1 thru whatever

     

    then perform a left join

    Select *

    From MyTable

    right Join numbers on mytable.pk = number

    where mytable.pk is null

      and number < (select max(mytable.pk) from mytable)

    This will create a number table with 1 millon sequential numbers

     CREATE TABLE Numbers(Number INT NOT NULL PRIMARY KEY)

     GO

     INSERT Numbers

     SELECT HundredThousand * 100000 + Tenthousand * 10000 + thousand  * 1000 + Hundred * 100 + Ten * 10 + Unit + 1

       FROM (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL  

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) Units (Unit)

      CROSS JOIN

            (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) Tens(Ten)

      CROSS JOIN

            (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) Hundreds(Hundred)

      CROSS JOIN

            (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) thousands(thousand)

      CROSS JOIN

            (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) Tenthousands(Tenthousand)

      CROSS JOIN

            (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) Hundredthousands(Hundredthousand)

     GO

  • Great idea.  And thanks for the sql code.

  • This will create a number table with 1 millon sequential numbers

     CREATE TABLE Numbers(Number INT NOT NULL PRIMARY KEY)

     GO

     INSERT Numbers

     SELECT HundredThousand * 100000 + Tenthousand * 10000 + thousand  * 1000 + Hundred * 100 + Ten * 10 + Unit + 1

       FROM (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL  

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) Units (Unit)

      CROSS JOIN

            (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) Tens(Ten)

      CROSS JOIN

            (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) Hundreds(Hundred)

      CROSS JOIN

            (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) thousands(thousand)

      CROSS JOIN

            (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) Tenthousands(Tenthousand)

      CROSS JOIN

            (

      SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL

      SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL

      SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL

      SELECT 9) Hundredthousands(Hundredthousand)

     GO

    Why that complicated?

    SELECT TOP 1000000 Num = IDENTITY(INT, 1, 1)

      INTO #Numbers

     FROM Northwind..sysobjects a1

      CROSS JOIN Northwind..sysobjects a2

      CROSS JOIN Northwind..sysobjects a3

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Just for fun, here's another variation on the first answer:

    SELECT

       Key+1

    FROM

       MyTable 

    WHERE

        --The next # is missing

       Key+1 NOT IN (SELECT Key FROM MyTable)

    AND

       --We haven't reached the max of values

       Key+1 < (SELECT Max(Key) FROM MyTable)

  • Ok, just for fun, the following code will report ranges of both present and missing ID's.  Why do it this way?  Because if you have a huge number of missing ID's in a given range (ie. ID's 500,000 through 700,000 are missing), who wants a list of over 200,000 ID's?  Another advantage of doing it this way is you don't need to build a huge temporary numbers table if the MaxID in the table being examined is huge (picky ol' DBA's get really twisted when you run TempDB out of space creating a numbers table with a couple of hundred million entries in it ) and it run's almost as fast (nothing runs faster than comparing to a numbers table, though).

    To summarize the code below...

    1. It creates an experimental table of a million ID's to simulate a table you may have to run the code on (#yourtable) [Takes about 6 seconds to do this on my Desktop PC].
    2. It deletes various ranges of records including some individual records.
    3. It calculates the "Missing" and "Present" ranges from 1 to the Max ID of the table (1 million, in this case).  [Takes about 7 seconds to do this on my Desktop PC]

    Lemme know wat'cha think...  sorry it'll be so hard to figure out what the code is doing ... my documentation skills aren't what they used to be...

    --===== Setup for speed and to prevent blocking

        SET NOCOUNT ON

        SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

    --=============================================================================

    --      Create an experimental table to simulate the table being examined

    --=============================================================================

    --===== If the experimental temp table exists, drop it so we can do repeat runs

         -- if we want to "play".

         IF OBJECT_ID('TempDB..#yourtable') IS NOT NULL

            DROP TABLE #yourtable

    --===== Create the experimental temp table and populate with IDs on the fly

         -- This always works because SYSCOLUMNS always has at least 256 entries

         -- even in a new database and 256*256*256 > 1,000,000

     SELECT TOP 1000000 ID = IDENTITY(INT, 1, 1)

       INTO #yourtable

       FROM dbo.SYSCOLUMNS sc1,

            dbo.SYSCOLUMNS sc2,

            dbo.SYSCOLUMNS sc3

    --===== Like any good table, our experimental table needs a Primary Key

      ALTER TABLE #yourtable

            ADD PRIMARY KEY CLUSTERED (ID)

    --===== Ok, we have an experimental table with a million IDs from 1 to 1,000,000.

         -- Let's remove some of the rows and see if we can find if they're missing

         -- This deletes a handful of individual rows and 3 very small ranges

     DELETE #yourtable

      WHERE ID IN (1,2,3,250,251,2000,4000,4002,4004,900001,900002,900003)

         -- This deletes a "monster" range just to see how it's handled.

     DELETE #yourtable

      WHERE ID BETWEEN 500000 AND 700000

    --===== At this point, the experiment is ready to rock

    --=============================================================================

    --      Create a report of all ranges of ID's, big or small, from 1 to the max

    --      ID present in the table.  Status of each range will be 'Missing' or

    --      'Present' and the MinID and MaxID of each range will also be reported.

    --      This is done without loops and without a Tally (numbers) table.  The

    --      reason for doing it without a numbers table is because if the max ID

    --      for the table is very large (hundreds of millions), it's not a good

    --      idea to put a table that large in TempDB (although some do) and the

    --      DBA's frequently deny table creation privs to make a working table in

    --      the DB.  The temp tables used below should stay fairly small unless

    --      someone did something really stupid like deleting every other row.

    --      AND, for this million row example, it only takes about 7 seconds to

    --      run on my little ol' desktop PC.  It should run MUCH faster on a server

    --      quality machine.

    --=============================================================================

    --===== If the working temp tables exist, drop them

         -- Conditionally drop the table that keeps track of "MinIDs" in a range

         IF OBJECT_ID('TempDB..#MyMins') IS NOT NULL

            DROP TABLE #MyMins

         -- Conditionally drop the table that keeps track of "MaxIDs" in a range

         IF OBJECT_ID('TempDB..#MyMaxes') IS NOT NULL

            DROP TABLE #MyMaxes

         -- Conditionally drop the table that holds the final report on ranges of IDs

         IF OBJECT_ID('TempDB..#Report') IS NOT NULL

            DROP TABLE #Report

    --===== Find all of the "MinIDs" for ranges that are present and put them in a table

     SELECT IDENTITY(INT,1,1) AS RowNum,

            CAST(ID AS INT) AS ID --Cast is necessary because ID is an IDENTITY column

       INTO #MyMins

       FROM #yourtable

      WHERE ID NOT IN (SELECT t1.ID

                         FROM #yourtable t1,

                              #yourtable t2

                        WHERE t1.ID=t2.ID+1)

      ORDER BY ID --This is necessary so we can match up with the MaxIDs

    --===== Find all of the "MaxIDs" for ranges that are present and put them in a table

     SELECT IDENTITY(INT,1,1) AS RowNum,

            CAST(ID AS INT) AS ID --Cast is necessary because ID is an IDENTITY column

       INTO #MyMaxes

       FROM #yourtable

      WHERE ID NOT IN (SELECT t1.ID

                         FROM #yourtable t1,

                              #yourtable t2

                        WHERE t1.ID+1=t2.ID)

      ORDER BY ID --This is necessary so we can match up with the MinIDs

    --===== Insert all of the "Present" ranges into the report table using the Min and Max

         -- tables to derive the ranges by matching on the RowNum of each table.  This

         -- step creates the Report table on the fly and the RowNum is very necessary for

         -- deriving the missing ID's in the steps that follow.

     SELECT IDENTITY(INT,1,1) AS RowNum,

            m.ID AS MinID,

            x.ID AS MaxID,

            'Present' AS Status

       INTO #Report

       FROM #MyMins m,

            #MyMaxes x

      WHERE m.RowNum = x.RowNum

    --===== Now, derive and insert all of the missing ranges of ID's save, perhaps, one.

     INSERT INTO #Report (MinID,MaxID,Status)

     SELECT r1.MaxID+1 AS MinID,

            r2.MinID-1 AS MaxID,

            'Missing' AS Status

       FROM #Report r1,

            #Report r2

      WHERE r1.RowNum+1 = r2.RowNum

    --===== If ID #1 is missing, we wouldn't have caught that so far.  So let's check

         -- now and add it's missing range if it's missing

         IF (SELECT MIN(MinID) FROM #Report)>1

             INSERT INTO #Report (MinID,MaxID,Status)

             SELECT 1 AS MinID,

                    MIN(MinID)-1 AS MaxID,

                    'Missing' AS Status

               FROM #Report

    --===== All set... display the report (could be used as a result set, as well)

     SELECT MinID,MaxID,Status

       FROM #REPORT

      ORDER BY MinID

    Now, WOULD SOMEONE GIVE ME AN EXAMPLE AS TO WHY THIS EVER NEEDS TO BE DONE?  I could see doing it maybe as a check for transmitted data but not much more.  And the old saw about a legacy system never having a row with a sequential column deleted just doesn't sit quite right with me.  I agree it's not a good idea to physically delete records but, it happens and no one should care about reusing the ID when there are so many to choose from.  So, why would anyone need to do this (find missing ID's)?  Surely you don't intend to reuse them

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Still just for fun   I think the following might be a bit simpler (I use Jeff's code to create #yourtable)...

     

         IF OBJECT_ID('TempDB..#MyMins') IS NOT NULL

            DROP TABLE #MyMins

         -- Conditionally drop the table that keeps track of "MaxIDs" in a range

         IF OBJECT_ID('TempDB..#MyMaxes') IS NOT NULL

            DROP TABLE #MyMaxes

         create table #MyMins(RowNum int identity(1,1), MinID int)

         create table #MyMaxes(RowNum int identity(1,1), MaxID int)

         insert #MyMins (MinID)

         select t1.ID

         from #yourtable t1 left join #yourtable t2

         on t1.ID = t2.ID + 1

         where t2.ID is null

         order by t1.ID

         insert #MyMaxes (MaxID)

         select t1.ID

         from #yourtable t1 left join #yourtable t2

         on t1.ID = t2.ID - 1

         where t2.ID is null

         order by t1.ID

         select t1.MinID as PresentFrom, t2.MaxID as PresentTo from #MyMins t1 inner join #MyMaxes t2 on

         t1.RowNum = t2.RowNum

         select isnull(t2.MaxID, 0) + 1 as MissingFrom, isnull(t1.MinID, 1000001) - 1 as MissingTo from #MyMins t1 full outer join #MyMaxes t2 on

         t1.RowNum = t2.RowNum + 1

         where t2.MaxID != 1000000

     

  • This is way too much fun.

    I have an unnatural aversion to temp tables and an equally unnatural propensity for clever querying.

    SELECT
    MinRange = (SELECT ISNULL(MAX(subA.TheKey),0)+1 FROM MyValues subA WHERE subA.TheKey < A.TheKey),
    MaxRange = TheKey - 1
    FROM MyValues A
    WHERE
    A.TheKey - 1 NOT IN
    (SELECT TheKey FROM MyValues)
  • You're right...that is very clever.  I'm glad you guys are having fun with this.  Meanwhile, I used the "create a table of Numbers" approach and am quite pleased with the results.

    Thanks for all your help.........cindy

  • Very cool, jratwork... I have a similar un-natural aversion to correlated subqueries and I had my doubts when I saw the "<" in the correlated subquery... but I just ran this against the million row temp table I made... it ran in less than 3 seconds and correctly found all of the missing ranges. Like I said, Very cool.

    There is, however, a fly in the ointment... if ID #1 is present, the first row returned is always

    MinRange MaxRange

    -------- --------

        1       0

    I added "the fix" to the last line of the code...

     SELECT MinRange = (SELECT ISNULL(MAX(suba.ID),0)+1

                          FROM #yourtable suba

                         WHERE suba.ID < a.ID),

           MaxRange = ID - 1

      FROM #yourtable a

     WHERE a.ID - 1 NOT IN (SELECT ID FROM #yourtable)

       AND a.ID - 1 > 0

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I had exactly the same doubts when I saw the statement, but I haven't found anything faster... I give up, you beat me... this time...

  • I am humbled by your veteran approval.

    I should have caught that one off scenario but I had focused on dealing with a missing id of 1.  Thank you.

    Until next time.

    JR

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply