January 19, 2011 at 3:42 am
Hi,
I'm trying to select specific rows from this table:
CREATE TABLE MVTS (idMvt1 int,
idMvt2 int,
idMvt3 int);
INSERT INTO MVTS
SELECT 271, 204, 136
UNION ALL
SELECT 271, 202, 136
UNION ALL
SELECT 268, 253, 204
UNION ALL
SELECT 268, 253, 202
UNION ALL
SELECT 268, 256, 204
UNION ALL
SELECT 268, 256, 202
the output expected is the first row and the fourth row
to explain the logic let's imagine we have a second table A to put in the selected id's
next w'll iterate throw MVTS :
- first row 271, 204, 136 ; none of this ids exists in A
So w'll select this row and put the 3 ids in A
- second row 271 exist already in A so w'll ignore this row
- third row 204 exists
- fourth row 268, 253, 202 none of this ids exists in A
So w'll select this row and put the 3 ids in A
in the end we should have only the first and the fourth row
I'm looking for a performant T-sql solution without Cursor and iteration because the average amount of row is by millions
thx
January 19, 2011 at 4:44 am
moadh.bs (1/19/2011)
Hi,I'm trying to select specific rows from this table:
CREATE TABLE MVTS (idMvt1 int,
idMvt2 int,
idMvt3 int);
INSERT INTO MVTS
SELECT 271, 204, 136
UNION ALL
SELECT 271, 202, 136
UNION ALL
SELECT 268, 253, 204
UNION ALL
SELECT 268, 253, 202
UNION ALL
SELECT 268, 256, 204
UNION ALL
SELECT 268, 256, 202
the output expected is the first row and the fourth row
to explain the logic let's imagine we have a second table A to put in the selected id's
next w'll iterate throw MVTS :
- first row 271, 204, 136 ; none of this ids exists in A
So w'll select this row and put the 3 ids in A
- second row 271 exist already in A so w'll ignore this row
- third row 204 exists
- fourth row 268, 253, 202 none of this ids exists in A
So w'll select this row and put the 3 ids in A
in the end we should have only the first and the fourth row
I'm looking for a performant T-sql solution without Cursor and iteration because the average amount of row is by millions
thx
Unless you put an autonumbering column on this MVTS table, there is no guarantee as to what the order of rows will be which, of course, could change the outcome.
Also, you've done a great job of explaining what you want done but not the business reason. It may help someone come up with an alternative if they know the business reason. It's also a fairly odd request and knowing the business reason for it would be good "payment" in fair trade for anyone who solves the problem. 🙂 It'll also let us know that it's not homework or a contest on another forum. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
January 19, 2011 at 5:13 am
You'r right Jeff, I will need an auto-increment column to ensure the order. But let's presume that's the case.
I tried to simplify the problem because it's a little difficult for me to explain the business reason (especially in English 😉 ). it's an accounting issue where I try to find match between entries to get a picture of a transaction.
Anyway, I'm at the point of a set of candidate of entry (the columns of MVTS), and at the end I'm gonna tag this entries matched together with a same number. That's why if an entry has been already selected, we have to ignore the others combinations including it.
this is a part of "Lettrage 1 à 2" mean combining one entry with 2 others.
I have to be able to go till combining 1 to 4.
January 19, 2011 at 5:30 am
It's an interesting problem because the supposed "set-based" methods that immediately come to mind are all going to be Triangular Joins which will be much slower than any kind of the recursion including a Cursor.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 20, 2011 at 8:17 pm
I've not been able to dedicate much more time thinking about this than I originally did. Did you ever get this problem figured out?
--Jeff Moden
Change is inevitable... Change for the better is not.
January 20, 2011 at 11:32 pm
Probably a self-referencing CTE should do it! Have u tried that?
January 21, 2011 at 1:23 am
Well the solution If we use a While or a cursor is obvious, I could post this if you want. but I figured out that a hybrid solution .Net & Sql worked best for me, even if we normally prefer coding this kind of process in sql server in order to enhance the maintainability. I'm aware that CTE could perform recursive processing but I'll be curious to see how we could resolve this problem with it.
January 22, 2011 at 7:26 am
Sorry for leading you into a wrong path, i figured out a simple UNPIVOT could do what we need. From the requirement, i am guessing that you need to find the unique values in all the rows and columns into a single column. A DISTINCT coupled with UNPIVOT can do this.
Here's how:
SELECT DISTINCT ValInCol
FROM
( SELECT idMvt1,idMvt2,idMvt3 FROM MVTS ) AS PIVOT_TABLE
UNPIVOT
( ValInCol FOR ColName IN (idMvt1,idMvt2,idMvt3) ) AS PIVOT_HANDLE
I am yet to run this over a million row to find out how this scales. I will do it when time permits 🙂 Meanwhile, if u could u run this piece ( if i had understood the requirement ;-)) over a million rows and post the results, that would be awesome..
January 22, 2011 at 7:44 am
Ok, i had some time to peek into this. Here are the results:
Test Data:
USE TempDB
GO
SET NOCOUNT ON
IF OBJECT_ID(N'MVTS',N'U') IS NOT NULL
DROP TABLE MVTS
CREATE TABLE MVTS ( idMvt1 int,
idMvt2 int,
idMvt3 int);
INSERT INTO MVTS
SELECT 271, 204, 136
UNION ALL
SELECT 271, 202, 136
UNION ALL
SELECT 268, 253, 204
UNION ALL
SELECT 268, 253, 202
UNION ALL
SELECT 268, 256, 204
UNION ALL
SELECT 268, 256, 202
INSERT INTO MVTS
SELECT (( ABS(CHECKSUM(NEWID())) % 100) + 200 ) AS idMvt1,
(( ABS(CHECKSUM(NEWID())) % 100) + 100 ) AS idMvt2,
(( ABS(CHECKSUM(NEWID())) % 100) + 300 ) AS idMvt3
FROM dbo.Tally T1,dbo.Tally T2
WHERE T1.RN < 1001 AND T2.RN < 1001
Code:
SELECT DISTINCT ValInCol
FROM
( SELECT idMvt1,idMvt2,idMvt3 FROM MVTS ) AS PIVOT_TABLE
UNPIVOT
( ValInCol FOR ColName IN (idMvt1,idMvt2,idMvt3) ) AS PIVOT_HANDLE
ORDER BY ValInCol
Test Harness:
DECLARE @Bitbucket INT
SET NOCOUNT ON
SET STATISTICS TIME ON
SELECT DISTINCT @Bitbucket = ValInCol
FROM
( SELECT idMvt1,idMvt2,idMvt3 FROM MVTS ) AS PIVOT_TABLE
UNPIVOT
( ValInCol FOR ColName IN (idMvt1,idMvt2,idMvt3) ) AS PIVOT_HANDLE
ORDER BY ValInCol
SET STATISTICS TIME OFF
Results:
SQL Server Execution Times:
CPU time = 655 ms, elapsed time = 431 ms.
SQL Server Execution Times:
CPU time = 671 ms, elapsed time = 428 ms.
SQL Server Execution Times:
CPU time = 607 ms, elapsed time = 422 ms.
Hope this helps! 😎
January 22, 2011 at 9:56 am
It's not just unique values the OP is trying to find. If it was, the OP probably would have been able to do it himself.
The goal is to find rows where none of the 3 numbers in any given row has been used by any previous row regardless of which position in the row the number appears. Take a look again at the OPs original post. It's not a simple "unique" problem... 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
January 22, 2011 at 12:11 pm
How does table A look like?
N 56°04'39.16"
E 12°55'05.25"
January 22, 2011 at 6:08 pm
In the interest of seeing a SQL solution even though you have a CLR solution...
I offer this
First just the solution (which assumes you have added an "ID" column)
SELECT *
FROM mvts
WHERE EXISTS (
SELECT minid
FROM (
SELECT VALUE,MIN(id) AS minid
FROM mvts
UNPIVOT
( VALUE FOR [COLUMN] IN (idMvt1,idMvt2,idMvt3) ) upvt
GROUP BY VALUE
) g
WHERE minid = id
GROUP BY minid
HAVING COUNT(*)=3 )
ORDER BY id
And here is the full test bed with some sample results from my PC...
--= build a test bed
IF OBJECT_ID('dbo.mvts') IS NULL
CREATE TABLE mvts (id INT IDENTITY (1, 1) PRIMARY KEY, idMvt1 INT, idMvt2 INT, idMvt3 INT);
--= build the test data one column at a time
--= this assumes numbers are less than ten thousand
--= but can be anything in reality
--= this way I was sure of duplicates over 1 million rows
INSERT
mvts (idMvt1)
SELECT
N % 10000
FROM
helpers.dbo.Tally AS T -- I have 1M rows in mine.
ORDER BY
NEWID ();
--= randomly populate the second column
;WITH cte AS (SELECT TOP 100 PERCENT idMvt2,ROW_NUMBER() OVER (ORDER BY NEWID()) rownum FROM mvts ORDER BY NEWID())
UPDATE cte
SET
idMvt2= rownum % 10000;
--= randomly populate the third column
;WITH cte AS (SELECT TOP 100 PERCENT idMvt3,ROW_NUMBER() OVER (ORDER BY NEWID()) rownum FROM mvts ORDER BY NEWID())
UPDATE cte
SET
idMvt3= rownum % 10000;
--= remove any rows with duplicate numbers
DELETE
FROM
mvts
WHERE
idMvt1 = idMvt2 OR
idMvt1 = idMvt3 OR
idMvt2 = idMvt3;
--= now for the real query
--= unpivots the columns then groups by number to find the lowest id (the first occurence of each number)
--= then groups that result by id to find rows that contain three first occurrences
--= then all that remains is to use that result to pick up the "values" from the idMvt? columns in mvts
SELECT id, idMvt1, idMvt2, idMvt3
FROM mvts
WHERE EXISTS (
SELECT minid
FROM (
SELECT VALUE,MIN(id) AS minid
FROM mvts
UNPIVOT
( VALUE FOR [COLUMN] IN (idMvt1,idMvt2,idMvt3) ) upvt
GROUP BY VALUE
) g
WHERE minid = id
GROUP BY minid
HAVING COUNT(*)=3 )
ORDER BY id
-- sample results for 1M rows with numbers up to 1000
--(122 row(s) affected)
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'mvts'. Scan count 5, logical reads 3819, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- SQL Server Execution Times:
-- CPU time = 1450 ms, elapsed time = 388 ms.
-- sample results for 1M rows with numbers up to 10000
--(1141 row(s) affected)
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'mvts'. Scan count 5, logical reads 6835, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- SQL Server Execution Times:
-- CPU time = 2060 ms, elapsed time = 922 ms.
-- sample results for 1M rows with numbers up to 100000
--(11305 row(s) affected)
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'mvts'. Scan count 5, logical reads 37327, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- SQL Server Execution Times:
-- CPU TIME = 5616 ms, elapsed TIME = 3518 ms.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
January 22, 2011 at 8:12 pm
@jeff, u are absolutely right.. i mis-read the requirement; from morning (its morning in INDIA now, as i write) i was thinking about that only.. So i sat and put together my piece of code to find the unique row . Hope this turns out to be correct
; WITH ROW_NUMBERED_LIST AS
(
SELECT ROW_NUM = ROW_NUMBER() OVER (ORDER BY (SELECT 0)) ,
idMvt1,idMvt2,idMvt3
FROM MVTS
),
Unpivoted_List AS
(
SELECT ROW_NUM , ValInCol
FROM
( SELECT ROW_NUM ,idMvt1,idMvt2,idMvt3 FROM ROW_NUMBERED_LIST ) AS PIVOT_TABLE
UNPIVOT
( ValInCol FOR ColName IN (idMvt1,idMvt2,idMvt3) ) AS PIVOT_HANDLE
--ORDER BY ROW_NUM
) ,
Ranked_List AS
(
SELECT ROW_NUM , ValInCol,
RANKING = ROW_NUMBER() OVER( PARTITION BY ValInCol ORDER BY ROW_NUM )
FROM UNPIVOTED_LIST
) ,
UNIQUE_IDS AS
(
SELECT ROW_NUM
FROM Ranked_List
GROUP BY ROW_NUM
HAVING COUNT( DISTINCT RANKING ) = 1
)
SELECT R_N_LST.ROW_NUM, R_N_LST.idMvt1,R_N_LST.idMvt2,R_N_LST.idMvt3
FROM ROW_NUMBERED_LIST R_N_LST
INNER JOIN UNIQUE_IDS UNIQ
ON R_N_LST.ROW_NUM = UNIQ.ROW_NUM
January 22, 2011 at 8:36 pm
Modified version (assumes ID column is already present in the table)
DECLARE @Bitbucket INT
SET NOCOUNT ON
SET STATISTICS TIME ON
;WITH Unpivoted_List AS
(
SELECT ROW_NUM , ValInCol
FROM
( SELECT ID AS ROW_NUM ,idMvt1,idMvt2,idMvt3 FROM MVTS ) AS PIVOT_TABLE
UNPIVOT
( ValInCol FOR ColName IN (idMvt1,idMvt2,idMvt3) ) AS PIVOT_HANDLE
--ORDER BY ROW_NUM
) ,
Ranked_List AS
(
SELECT ROW_NUM , ValInCol,
RANKING = ROW_NUMBER() OVER( PARTITION BY ValInCol ORDER BY ROW_NUM )
FROM UNPIVOTED_LIST
) ,
UNIQUE_IDS AS
(
SELECT ROW_NUM
FROM Ranked_List
GROUP BY ROW_NUM
HAVING COUNT( * ) = 3
)
SELECT @Bitbucket = R_N_LST.ID --, R_N_LST.idMvt1,R_N_LST.idMvt2,R_N_LST.idMvt3
FROM MVTS R_N_LST
INNER JOIN UNIQUE_IDS UNIQ
ON R_N_LST.ID = UNIQ.ROW_NUM
SET STATISTICS TIME OFF
January 23, 2011 at 9:48 am
mister.magoo (1/22/2011)
In the interest of seeing a SQL solution even though you have a CLR solution...I offer this
First just the solution (which assumes you have added an "ID" column)
SELECT *
FROM mvts
WHERE EXISTS (
SELECT minid
FROM (
SELECT VALUE,MIN(id) AS minid
FROM mvts
UNPIVOT
( VALUE FOR [COLUMN] IN (idMvt1,idMvt2,idMvt3) ) upvt
GROUP BY VALUE
) g
WHERE minid = id
GROUP BY minid
HAVING COUNT(*)=3 )
ORDER BY id
And here is the full test bed with some sample results from my PC...
--= build a test bed
IF OBJECT_ID('dbo.mvts') IS NULL
CREATE TABLE mvts (id INT IDENTITY (1, 1) PRIMARY KEY, idMvt1 INT, idMvt2 INT, idMvt3 INT);
--= build the test data one column at a time
--= this assumes numbers are less than ten thousand
--= but can be anything in reality
--= this way I was sure of duplicates over 1 million rows
INSERT
mvts (idMvt1)
SELECT
N % 10000
FROM
helpers.dbo.Tally AS T -- I have 1M rows in mine.
ORDER BY
NEWID ();
--= randomly populate the second column
;WITH cte AS (SELECT TOP 100 PERCENT idMvt2,ROW_NUMBER() OVER (ORDER BY NEWID()) rownum FROM mvts ORDER BY NEWID())
UPDATE cte
SET
idMvt2= rownum % 10000;
--= randomly populate the third column
;WITH cte AS (SELECT TOP 100 PERCENT idMvt3,ROW_NUMBER() OVER (ORDER BY NEWID()) rownum FROM mvts ORDER BY NEWID())
UPDATE cte
SET
idMvt3= rownum % 10000;
--= remove any rows with duplicate numbers
DELETE
FROM
mvts
WHERE
idMvt1 = idMvt2 OR
idMvt1 = idMvt3 OR
idMvt2 = idMvt3;
--= now for the real query
--= unpivots the columns then groups by number to find the lowest id (the first occurence of each number)
--= then groups that result by id to find rows that contain three first occurrences
--= then all that remains is to use that result to pick up the "values" from the idMvt? columns in mvts
SELECT id, idMvt1, idMvt2, idMvt3
FROM mvts
WHERE EXISTS (
SELECT minid
FROM (
SELECT VALUE,MIN(id) AS minid
FROM mvts
UNPIVOT
( VALUE FOR [COLUMN] IN (idMvt1,idMvt2,idMvt3) ) upvt
GROUP BY VALUE
) g
WHERE minid = id
GROUP BY minid
HAVING COUNT(*)=3 )
ORDER BY id
-- sample results for 1M rows with numbers up to 1000
--(122 row(s) affected)
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'mvts'. Scan count 5, logical reads 3819, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- SQL Server Execution Times:
-- CPU time = 1450 ms, elapsed time = 388 ms.
-- sample results for 1M rows with numbers up to 10000
--(1141 row(s) affected)
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'mvts'. Scan count 5, logical reads 6835, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- SQL Server Execution Times:
-- CPU time = 2060 ms, elapsed time = 922 ms.
-- sample results for 1M rows with numbers up to 100000
--(11305 row(s) affected)
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'mvts'. Scan count 5, logical reads 37327, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
-- SQL Server Execution Times:
-- CPU TIME = 5616 ms, elapsed TIME = 3518 ms.
Dang... thought you might have had it, Magoo. But it doesn't work on the original data. See the comments in the data setup below...
DROP TABLE MVTS
CREATE TABLE MVTS
(
ID INT IDENTITY(1,1),
idMvt1 int,
idMvt2 int,
idMvt3 int
);
INSERT INTO MVTS
(idMvt1,idMvt2,idMvt3)
SELECT 271, 204, 136 UNION ALL --1 This will always show up because it's the "first"
SELECT 271, 202, 136 UNION ALL --2 This should NOT show because 271 and 136 is contained in 1 (which is the only "active" row at this point)
SELECT 268, 253, 204 UNION ALL --3 This should NOT show because 204 is contained in 1 (which is the only "active" row at this point)
SELECT 268, 253, 202 UNION ALL --4 This SHOULD show because even though 268 and 253 are in the previous row,
-- none of the elements live in 1 which is the only active row at this point.
SELECT 268, 256, 204 UNION ALL --5 Should NOT show because at least 1 element lives in 1 or 4 which are "active"
SELECT 268, 256, 202 --6 Should NOT show because at least 1 element lives in 4 which is "active"
--===== Magoo's code
SELECT *
FROM mvts
WHERE EXISTS (
SELECT minid
FROM (
SELECT VALUE,MIN(id) AS minid
FROM mvts
UNPIVOT
( VALUE FOR [COLUMN] IN (idMvt1,idMvt2,idMvt3) ) upvt
GROUP BY VALUE
) g
WHERE minid = id
GROUP BY minid
HAVING COUNT(*)=3 )
ORDER BY id
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 75 total)
You must be logged in to reply to this topic. Login to reply