Help with row_number()

  • So I have a table with 1 million + rows.  It serves as a mapping table between two other tables and consists of 2 columns, ClientID and RegionID.  I found out that I have some duplicate rows in there, and my cleanup script requires a unique "ID" column in order to get rid of the dupes.  At first I tried using the GUI to add a new autoincrement column but that timed out.  Then I tried generating a change script and running that but I had another problem (timeout and something else i cant remember).  So finally I decided to just add a column with no data, and then use row_number to populate the column.  However, I'm having some issues.  I'm trying to do this:

     

    INSERT INTO 'table' (IDcolumn)

         SELECT row_number() over(order by ClientID, RegionID) FROM 'table'

    I keep getting a message saying 'Cannot insert value NULL into ClientID'.  Why is it trying to do anything with the other 2 columns?? What am I missing?? Thanks much

  • There was a  post on SQLServerCentral a while back that might be useful to you.  I have used it with a good deal of success.  Search the site for the script with the title:  "Remove Duplicate Values from a Table"

  • Thanks.. That might work for me, although, I might need to tweak it a bit since the problem I'm having is not just dupes in one table, but dupes in 2 tables that are related.  Kinda hard to explain but I'll see if I can't work something out with that script.  Thanks.  Would happen to have any suggestions for my row_number() problem though would yoU?

  • >>Would happen to have any suggestions for my row_number() problem though would yoU?

    You're doing an INSERT, when you should be doing an UPDATE.

    UPDATE

    t

    SET IDColumn = dt.RowNumber

    From YourTable as t

    Inner Join

    (

      Select row_number() over(order by ClientID, RegionID) As RowNumber,

        ClientID,

        RegionID

      From YourTable

    ) dt

      On (t.ClientID = dt.ClientID And

         t.RegionID = dt.RegionID)

  • I don't see how the update can solve the problem.  The update needs to know which row to update (unique id of sort some).  And since there are duplicates in the table, I suspect they will both be updated to the same value!!

     

    I think you don't have any choice but adding and identity column, or maby a GUI.  You'll have to find a way to not have the process timeout (maybe using another application than SSMS assuming you can change the setting to unlimited).

  • Or another much less costly solution :

    CREATE TABLE #WorkTable (ClientID, RegionID)

    CREATE UNIQUE CLUSTERED INDEX IX_WorkTable ON #WorkTable (ClientID, RegionID)

    Insert into #WorkTable (ClientID, RegionID)

    Select CliendID, RegionID FROM dbo.ClientsRegions GROUP BY ClientID, RegionID HAVING COUNT(*) > 1

    DELETE CR FROM dbo.ClientsRegions CR INNER JOIN #WorkTable WT ON CR.ClientID = WT.ClientID AND CR.RegionID = WT.RegionID

    Insert into dbo.ClientsRegions (ClientID, RegionID))

    SELECT ClientID, RegionID FROM #WorkTable

    DROP TABLE #WorkTable

  • Excellent... Thanks PW, it worked flawlessly.  And Ninja, in my case it didn't really matter what row was updated, just as long as each row was updated with a unique number, which is exactly what the suggestion that PW provided accomplished.  Now that every row has a unique id, i can run my script to remove the dupes.  If it doesn't work as I had planned though, your suggestion looks like it might work too.. I'll give it a try if mine fails.  Thanks to both of you

  • I don't wanna be a pain in the *** but that update statement ain't cutting it for me (sql 2000). I don't think this feature was changed in 2k5 but then again I can't test to be 100% sure. I'd make sure that you still don't have duplicates in the table (group by all 3 columns). I'll be glad to be wrong but I'm afraid I'm not!

    DECLARE @Demo TABLE (ClientID INTRegionID INTUID INT)

    INSERT INTO @Demo (ClientIDRegionIDUID)

    SELECT 11, NULL

    UNION ALL

    SELECT 11, NULL

    UNION ALL

    SELECT 21, NULL

    UNION ALL

    SELECT 12, NULL

    SELECT D.ClientIDD.RegionIDdtUID.UID FROM @Demo D

    INNER JOIN

    (

    SELECT AS ClientIDAS RegionIDAS UID

    UNION ALL

    SELECT 112

    UNION ALL

    SELECT 213

    UNION ALL

    SELECT 124

    dtUID

    ON D.ClientID dtUID.ClientID AND D.RegionID dtUID.RegionID

    /*

    ClientID    RegionID    UID         

    ----------- ----------- ----------- 

    1           1           1

    1           1           2

    1           1           1

    1           1           2

    2           1           3

    1           2           4

    (6 ligne(s) affectÉe(s))

    */

    UPDATE SET D.UID dtUID.UID FROM @Demo D

    INNER JOIN

    (

    SELECT AS ClientIDAS RegionIDAS UID

    UNION ALL

    SELECT 112

    UNION ALL

    SELECT 213

    UNION ALL

    SELECT 124

    dtUID

    ON D.ClientID dtUID.ClientID AND D.RegionID dtUID.RegionID

    /*

    (4 ligne(s) affectÉe(s))

    */

    SELECT ClientIDRegionIDUID FROM @Demo

    /*

    ClientID    RegionID    UID         

    ----------- ----------- ----------- 

    1           1           2

    1           1           2

    2           1           3

    1           2           4

    (4 ligne(s) affectÉe(s))

    */

  • So I just spent 10 minutes writing a response explaining it and somehow lost all my text.... Well, I really don't want to rewrite it, so heres a quick summary.  I have 3 tables: Regions (RegionID, type, value), Schools (ClientID, etc....), and Schools_Regions (ClientID, RegionID) that serves as a mapping between the two.  Due to a bulk insert that didn't work perfectly, I had many dupes in my Regions table.  The way thats supposed to work is only 1 RegionID with the same type/value that way I can have 50 different rows in my Schools_regions table all referencing that one RegionID.  So basically I had a bunch of dupes all over the place.. I wrote a script that would get rid of them all, but I needed a unique ID column in my Schools_Regions table in order for it to work.  That is what the row_number() update accomplished.  Once I had that I ran this script and it worked fine: 

     

     -- Get All Data with row number

     SELECT ROW_NUMBER() OVER(PARTITION BY [value] ORDER BY [Value]) AS RN, RegionID, [value]

     INTO #TempRegions

     FROM Regions;

     -- Delete all the records we want to keep

     DELETE FROM #TempRegions WHERE RN = 1;

     -- look at the data we want to smoke

     SELECT * FROM #TempRegions;

     -- Lok at referenced dupes on the link server

     DELETE FROM Schools_Regions WHERE ID IN

      (

      SELECT ID

      FROM Schools_Regions

      WHERE RegionID IN

       (

        SELECT RegionID FROM #TempRegions

      &nbsp

      

     

     DELETE FROM Regions

     WHERE RegionID IN

      (

       SELECT RegionID FROM #TempRegions

      

     

     -- Clean House

     DROP Table #Tempregions

     

     

  • That makes more sens... too bad I don't have sql 2k5 to test it.

  • If you want to mess around with it I suggest that you grab the express version.  http://msdn.microsoft.com/vstudio/express/sql/ .  Lots of cool new stuff

  • Thanx for the info.  Will do that someday, again.  But not today .

  • Whenever I see something create a temp table and then delete unwanted rows, I wonder if the final table could be generated in one step.  The ROW_NUMBER function can't be used in a WHERE clause, but you can use it in the field list of a derived table subquery and filter the row numbers in the main query.

    And if you can generate the table with the RegionID's to be deleted directly, you don't need the temp table.  Use the ROW_NUMBER subquery right in the DELETE statement.  If the School_Regions table has a foreign key it won't like the delete, but you can temporarily disable it.

    ALTER

    TABLE Schools_Regions NOCHECK CONSTRAINT ALL

    DELETE r FROM Regions r

    INNER JOIN (

        SELECT RegionID, ROW_NUMBER() OVER(PARTITION BY [value] ORDER BY [value]) AS RN FROM dbo.Regions

    ) x ON r.RegionID = x.RegionID AND x.RN > 1

    DELETE FROM Schools_Regions

    WHERE RegionID NOT IN (SELECT RegionID FROM Regions)

    ALTER TABLE Schools_Regions WITH CHECK CHECK CONSTRAINT ALL

  • You're absolutely right.  That looks like it would work without having to create the temp table.  I wouldn't have even thought of doing it that way. 

  • I don't understand the 2nd delete in this code.  Were are deleting duplicates.  So 1 row will remain which means that there won't be orphaned rows.  Other than that it's a very nice idea.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply