December 11, 2006 at 12:06 pm
So I have a table with 1 million + rows. It serves as a mapping table between two other tables and consists of 2 columns, ClientID and RegionID. I found out that I have some duplicate rows in there, and my cleanup script requires a unique "ID" column in order to get rid of the dupes. At first I tried using the GUI to add a new autoincrement column but that timed out. Then I tried generating a change script and running that but I had another problem (timeout and something else i cant remember). So finally I decided to just add a column with no data, and then use row_number to populate the column. However, I'm having some issues. I'm trying to do this:
INSERT INTO 'table' (IDcolumn)
SELECT row_number() over(order by ClientID, RegionID) FROM 'table'
I keep getting a message saying 'Cannot insert value NULL into ClientID'. Why is it trying to do anything with the other 2 columns?? What am I missing?? Thanks much
December 11, 2006 at 12:40 pm
There was a post on SQLServerCentral a while back that might be useful to you. I have used it with a good deal of success. Search the site for the script with the title: "Remove Duplicate Values from a Table"
December 11, 2006 at 12:45 pm
Thanks.. That might work for me, although, I might need to tweak it a bit since the problem I'm having is not just dupes in one table, but dupes in 2 tables that are related. Kinda hard to explain but I'll see if I can't work something out with that script. Thanks. Would happen to have any suggestions for my row_number() problem though would yoU?
December 11, 2006 at 1:20 pm
>>Would happen to have any suggestions for my row_number() problem though would yoU?
You're doing an INSERT, when you should be doing an UPDATE.
UPDATE
t
SET IDColumn = dt.RowNumber
From YourTable as t
Inner Join
(
Select row_number() over(order by ClientID, RegionID) As RowNumber,
ClientID,
RegionID
From YourTable
) dt
On (t.ClientID = dt.ClientID And
t.RegionID = dt.RegionID)
December 11, 2006 at 1:30 pm
I don't see how the update can solve the problem. The update needs to know which row to update (unique id of sort some). And since there are duplicates in the table, I suspect they will both be updated to the same value!!
I think you don't have any choice but adding and identity column, or maby a GUI. You'll have to find a way to not have the process timeout (maybe using another application than SSMS assuming you can change the setting to unlimited).
December 11, 2006 at 1:36 pm
Or another much less costly solution :
CREATE TABLE #WorkTable (ClientID, RegionID)
CREATE UNIQUE CLUSTERED INDEX IX_WorkTable ON #WorkTable (ClientID, RegionID)
Insert into #WorkTable (ClientID, RegionID)
Select CliendID, RegionID FROM dbo.ClientsRegions GROUP BY ClientID, RegionID HAVING COUNT(*) > 1
DELETE CR FROM dbo.ClientsRegions CR INNER JOIN #WorkTable WT ON CR.ClientID = WT.ClientID AND CR.RegionID = WT.RegionID
Insert into dbo.ClientsRegions (ClientID, RegionID))
SELECT ClientID, RegionID FROM #WorkTable
DROP TABLE #WorkTable
December 11, 2006 at 2:04 pm
Excellent... Thanks PW, it worked flawlessly. And Ninja, in my case it didn't really matter what row was updated, just as long as each row was updated with a unique number, which is exactly what the suggestion that PW provided accomplished. Now that every row has a unique id, i can run my script to remove the dupes. If it doesn't work as I had planned though, your suggestion looks like it might work too.. I'll give it a try if mine fails. Thanks to both of you
December 11, 2006 at 2:27 pm
I don't wanna be a pain in the *** but that update statement ain't cutting it for me (sql 2000). I don't think this feature was changed in 2k5 but then again I can't test to be 100% sure. I'd make sure that you still don't have duplicates in the table (group by all 3 columns). I'll be glad to be wrong but I'm afraid I'm not!
DECLARE @Demo TABLE (ClientID INT, RegionID INT, UID INT)
INSERT INTO @Demo (ClientID, RegionID, UID)
SELECT 1, 1, NULL
UNION ALL
SELECT 1, 1, NULL
UNION ALL
SELECT 2, 1, NULL
UNION ALL
SELECT 1, 2, NULL
SELECT D.ClientID, D.RegionID, dtUID.UID FROM @Demo D
INNER JOIN
(
SELECT 1 AS ClientID, 1 AS RegionID, 1 AS UID
UNION ALL
SELECT 1, 1, 2
UNION ALL
SELECT 2, 1, 3
UNION ALL
SELECT 1, 2, 4
) dtUID
ON D.ClientID = dtUID.ClientID AND D.RegionID = dtUID.RegionID
/*
ClientID RegionID UID
----------- ----------- -----------
1 1 1
1 1 2
1 1 1
1 1 2
2 1 3
1 2 4
(6 ligne(s) affectÉe(s))
*/
UPDATE D SET D.UID = dtUID.UID FROM @Demo D
INNER JOIN
(
SELECT 1 AS ClientID, 1 AS RegionID, 1 AS UID
UNION ALL
SELECT 1, 1, 2
UNION ALL
SELECT 2, 1, 3
UNION ALL
SELECT 1, 2, 4
) dtUID
ON D.ClientID = dtUID.ClientID AND D.RegionID = dtUID.RegionID
/*
(4 ligne(s) affectÉe(s))
*/
SELECT ClientID, RegionID, UID FROM @Demo
/*
ClientID RegionID UID
----------- ----------- -----------
1 1 2
1 1 2
2 1 3
1 2 4
(4 ligne(s) affectÉe(s))
*/
December 11, 2006 at 3:02 pm
So I just spent 10 minutes writing a response explaining it and somehow lost all my text.... Well, I really don't want to rewrite it, so heres a quick summary. I have 3 tables: Regions (RegionID, type, value), Schools (ClientID, etc....), and Schools_Regions (ClientID, RegionID) that serves as a mapping between the two. Due to a bulk insert that didn't work perfectly, I had many dupes in my Regions table. The way thats supposed to work is only 1 RegionID with the same type/value that way I can have 50 different rows in my Schools_regions table all referencing that one RegionID. So basically I had a bunch of dupes all over the place.. I wrote a script that would get rid of them all, but I needed a unique ID column in my Schools_Regions table in order for it to work. That is what the row_number() update accomplished. Once I had that I ran this script and it worked fine:
-- Get All Data with row number
SELECT ROW_NUMBER() OVER(PARTITION BY [value] ORDER BY [Value]) AS RN, RegionID, [value]
INTO #TempRegions
FROM Regions;
-- Delete all the records we want to keep
DELETE FROM #TempRegions WHERE RN = 1;
-- look at the data we want to smoke
SELECT * FROM #TempRegions;
-- Lok at referenced dupes on the link server
DELETE FROM Schools_Regions WHERE ID IN
(
SELECT ID
FROM Schools_Regions
WHERE RegionID IN
(
SELECT RegionID FROM #TempRegions
 
DELETE FROM Regions
WHERE RegionID IN
(
SELECT RegionID FROM #TempRegions
-- Clean House
DROP Table #Tempregions
December 12, 2006 at 7:21 am
That makes more sens... too bad I don't have sql 2k5 to test it.
December 12, 2006 at 7:26 am
If you want to mess around with it I suggest that you grab the express version. http://msdn.microsoft.com/vstudio/express/sql/ . Lots of cool new stuff
December 12, 2006 at 7:57 am
Thanx for the info. Will do that someday, again. But not today .
December 12, 2006 at 8:56 am
Whenever I see something create a temp table and then delete unwanted rows, I wonder if the final table could be generated in one step. The ROW_NUMBER function can't be used in a WHERE clause, but you can use it in the field list of a derived table subquery and filter the row numbers in the main query.
And if you can generate the table with the RegionID's to be deleted directly, you don't need the temp table. Use the ROW_NUMBER subquery right in the DELETE statement. If the School_Regions table has a foreign key it won't like the delete, but you can temporarily disable it.
ALTER
TABLE Schools_Regions NOCHECK CONSTRAINT ALL
DELETE r FROM Regions r
INNER JOIN (
SELECT RegionID, ROW_NUMBER() OVER(PARTITION BY [value] ORDER BY [value]) AS RN FROM dbo.Regions
) x ON r.RegionID = x.RegionID AND x.RN > 1
DELETE FROM Schools_Regions
WHERE RegionID NOT IN (SELECT RegionID FROM Regions)
ALTER TABLE Schools_Regions WITH CHECK CHECK CONSTRAINT ALL
December 12, 2006 at 9:00 am
You're absolutely right. That looks like it would work without having to create the temp table. I wouldn't have even thought of doing it that way.
December 12, 2006 at 9:08 am
I don't understand the 2nd delete in this code. Were are deleting duplicates. So 1 row will remain which means that there won't be orphaned rows. Other than that it's a very nice idea.
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply