Help with row_number()

Question

Help with row_number()

El Guapo

Old Hand

Points: 366
More actions
December 11, 2006 at 12:06 pm

#171220

So I have a table with 1 million + rows. It serves as a mapping table between two other tables and consists of 2 columns, ClientID and RegionID. I found out that I have some duplicate rows in there, and my cleanup script requires a unique "ID" column in order to get rid of the dupes. At first I tried using the GUI to add a new autoincrement column but that timed out. Then I tried generating a change script and running that but I had another problem (timeout and something else i cant remember). So finally I decided to just add a column with no data, and then use row_number to populate the column. However, I'm having some issues. I'm trying to do this:

INSERT INTO 'table' (IDcolumn)
SELECT row_number() over(order by ClientID, RegionID) FROM 'table'
I keep getting a message saying 'Cannot insert value NULL into ClientID'. Why is it trying to do anything with the other 2 columns?? What am I missing?? Thanks much

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

jmcgarvey Default port Points: 1489 More actions · Answer 1

There was a post on SQLServerCentral a while back that might be useful to you. I have used it with a good deal of success. Search the site for the script with the title: "Remove Duplicate Values from a Table"

El Guapo Old Hand Points: 366 More actions · Answer 2

Thanks.. That might work for me, although, I might need to tweak it a bit since the problem I'm having is not just dupes in one table, but dupes in 2 tables that are related. Kinda hard to explain but I'll see if I can't work something out with that script. Thanks. Would happen to have any suggestions for my row_number() problem though would yoU?

PW-201837 SSC-Insane Points: 20805 More actions · Answer 3

>>Would happen to have any suggestions for my row_number() problem though would yoU?

You're doing an INSERT, when you should be doing an UPDATE.

UPDATE

t

SET IDColumn = dt.RowNumber

From YourTable as t

Inner Join

(

Select row_number() over(order by ClientID, RegionID) As RowNumber,

ClientID,

RegionID

From YourTable

) dt

On (t.ClientID = dt.ClientID And

t.RegionID = dt.RegionID)

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 4

I don't see how the update can solve the problem. The update needs to know which row to update (unique id of sort some). And since there are duplicates in the table, I suspect they will both be updated to the same value!!

I think you don't have any choice but adding and identity column, or maby a GUI. You'll have to find a way to not have the process timeout (maybe using another application than SSMS assuming you can change the setting to unlimited).

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 5

Or another much less costly solution :

CREATE TABLE #WorkTable (ClientID, RegionID)

CREATE UNIQUE CLUSTERED INDEX IX_WorkTable ON #WorkTable (ClientID, RegionID)

Insert into #WorkTable (ClientID, RegionID)

Select CliendID, RegionID FROM dbo.ClientsRegions GROUP BY ClientID, RegionID HAVING COUNT(*) > 1

DELETE CR FROM dbo.ClientsRegions CR INNER JOIN #WorkTable WT ON CR.ClientID = WT.ClientID AND CR.RegionID = WT.RegionID

Insert into dbo.ClientsRegions (ClientID, RegionID))

SELECT ClientID, RegionID FROM #WorkTable

DROP TABLE #WorkTable

El Guapo Old Hand Points: 366 More actions · Answer 6

Excellent... Thanks PW, it worked flawlessly. And Ninja, in my case it didn't really matter what row was updated, just as long as each row was updated with a unique number, which is exactly what the suggestion that PW provided accomplished. Now that every row has a unique id, i can run my script to remove the dupes. If it doesn't work as I had planned though, your suggestion looks like it might work too.. I'll give it a try if mine fails. Thanks to both of you

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 7

I don't wanna be a pain in the *** but that update statement ain't cutting it for me (sql 2000). I don't think this feature was changed in 2k5 but then again I can't test to be 100% sure. I'd make sure that you still don't have duplicates in the table (group by all 3 columns). I'll be glad to be wrong but I'm afraid I'm not!

DECLARE @Demo TABLE (ClientID INT, RegionID INT, UID INT)

INSERT INTO @Demo (ClientID, RegionID, UID)

SELECT 1, 1, NULL

UNION ALL

SELECT 1, 1, NULL

UNION ALL

SELECT 2, 1, NULL

UNION ALL

SELECT 1, 2, NULL

SELECT D.ClientID, D.RegionID, dtUID.UID FROM @Demo D

INNER JOIN

(

SELECT 1 AS ClientID, 1 AS RegionID, 1 AS UID

UNION ALL

SELECT 1, 1, 2

UNION ALL

SELECT 2, 1, 3

UNION ALL

SELECT 1, 2, 4

) dtUID

ON D.ClientID = dtUID.ClientID AND D.RegionID = dtUID.RegionID

/*

ClientID RegionID UID

----------- ----------- -----------

1 1 1

1 1 2

1 1 1

1 1 2

2 1 3

1 2 4

(6 ligne(s) affectÉe(s))

*/

UPDATE D SET D.UID = dtUID.UID FROM @Demo D

INNER JOIN

(

SELECT 1 AS ClientID, 1 AS RegionID, 1 AS UID

UNION ALL

SELECT 1, 1, 2

UNION ALL

SELECT 2, 1, 3

UNION ALL

SELECT 1, 2, 4

) dtUID

ON D.ClientID = dtUID.ClientID AND D.RegionID = dtUID.RegionID

/*

(4 ligne(s) affectÉe(s))

*/

SELECT ClientID, RegionID, UID FROM @Demo

/*

ClientID RegionID UID

----------- ----------- -----------

1 1 2

2 1 3

1 2 4

(4 ligne(s) affectÉe(s))

*/

El Guapo Old Hand Points: 366 More actions · Answer 8

So I just spent 10 minutes writing a response explaining it and somehow lost all my text.... Well, I really don't want to rewrite it, so heres a quick summary. I have 3 tables: Regions (RegionID, type, value), Schools (ClientID, etc....), and Schools_Regions (ClientID, RegionID) that serves as a mapping between the two. Due to a bulk insert that didn't work perfectly, I had many dupes in my Regions table. The way thats supposed to work is only 1 RegionID with the same type/value that way I can have 50 different rows in my Schools_regions table all referencing that one RegionID. So basically I had a bunch of dupes all over the place.. I wrote a script that would get rid of them all, but I needed a unique ID column in my Schools_Regions table in order for it to work. That is what the row_number() update accomplished. Once I had that I ran this script and it worked fine:

-- Get All Data with row number

SELECT ROW_NUMBER() OVER(PARTITION BY [value] ORDER BY [Value]) AS RN, RegionID, [value]

INTO #TempRegions

FROM Regions;

-- Delete all the records we want to keep

DELETE FROM #TempRegions WHERE RN = 1;

-- look at the data we want to smoke

SELECT * FROM #TempRegions;

-- Lok at referenced dupes on the link server

DELETE FROM Schools_Regions WHERE ID IN

(

SELECT ID

FROM Schools_Regions

WHERE RegionID IN

(

SELECT RegionID FROM #TempRegions

&nbsp

DELETE FROM Regions

WHERE RegionID IN

(

SELECT RegionID FROM #TempRegions

-- Clean House

DROP Table #Tempregions

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 9

That makes more sens... too bad I don't have sql 2k5 to test it.

El Guapo Old Hand Points: 366 More actions · Answer 10

If you want to mess around with it I suggest that you grab the express version. http://msdn.microsoft.com/vstudio/express/sql/ . Lots of cool new stuff

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 11

Thanx for the info. Will do that someday, again. But not today .

Scott Coleman One Orange Chip Points: 27452 More actions · Answer 12

Whenever I see something create a temp table and then delete unwanted rows, I wonder if the final table could be generated in one step. The ROW_NUMBER function can't be used in a WHERE clause, but you can use it in the field list of a derived table subquery and filter the row numbers in the main query.

And if you can generate the table with the RegionID's to be deleted directly, you don't need the temp table. Use the ROW_NUMBER subquery right in the DELETE statement. If the School_Regions table has a foreign key it won't like the delete, but you can temporarily disable it.

ALTER

TABLE Schools_Regions NOCHECK CONSTRAINT ALL

DELETE r FROM Regions r

INNER JOIN (

SELECT RegionID, ROW_NUMBER() OVER(PARTITION BY [value] ORDER BY [value]) AS RN FROM dbo.Regions

) x ON r.RegionID = x.RegionID AND x.RN > 1

DELETE FROM Schools_Regions

WHERE RegionID NOT IN (SELECT RegionID FROM Regions)

ALTER TABLE Schools_Regions WITH CHECK CHECK CONSTRAINT ALL

El Guapo Old Hand Points: 366 More actions · Answer 13

You're absolutely right. That looks like it would work without having to create the temp table. I wouldn't have even thought of doing it that way.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 14

I don't understand the 2nd delete in this code. Were are deleting duplicates. So 1 row will remain which means that there won't be orphaned rows. Other than that it's a very nice idea.