November 17, 2010 at 10:03 am
I have a test procedure that select rows based on uniqueness in a column. It is simple to process using a loop, recursive, etc. But, without using recursive table constructors, or a loop, is proving quite challenging.
Here's some sample data:
[font="Courier New"]create TABLE t_input(
c1INT,
c2 INT,
c3 INT,
sq INT IDENTITY (1,1) ) ;
insert t_input ( c1, c2, c3) values
( 1, 2, 3 ),
( 4, 1, 6),
( 1, 7, 8),
( 9, 2, 10),
(11,12, 6),
(13,14, 15),
( 9,20, 21),
(22,23, 3);[/font]
Here's the trival proc:
...
BEGIN
DECLARE zzz INT;
DECLARE cnt INT;
select cnt=count(*) FROM t_input;
set zzz := 1;
create TABLE #t1(
c1INT,
c2 INT,
c3 INT);
CREATE UNIQUE INDEX ix1 on #t1 ( c1 ) ;
CREATE UNIQUE INDEX ix2 on #t1 ( c2 ) ;
CREATE UNIQUE INDEX ix3 on #t1 ( c3 ) ;
while ( zzz <= cnt )
BEGIN
insert into #t1
select z.*
FROM t_input as z
WHERE not exists ( SELECT * FROM #t1 WHERE #t1.c1 = z.c1 ) and
not exists ( SELECT * FROM #t1 WHERE #t1.c2 = z.c2 ) and
not exists ( SELECT * FROM #t1 WHERE #t1.c3 = z.c3 )
and z.sq = zzz;
set zzz := zzz + 1;
end;
select * FROM #t1;
...
Expected output:
[font="Courier New"]c1c2c3sq
-----------------------------
1231
4162
1314156
920217
[/font]
It should: select row#1, as it is first, nothing to stop it.
select #2, the "1" in the second column is unique to column2.
discard #3, the "1" in c1 is a dupe of #1.
discard #4, the "2" in c2 is a dupe of #1.
discard #5, the "6" in c3 is a dupe of #2.
select #6
select #7, though the "9" in c1 is a dupe of #4, we discarded #4, so we dont consider it a dupe.
discard #8
I've been trying to come up with a single statement that answers this. Problem is, it must run on version back to 2K.
Closest I've come is:
select q.* FROM t_input as q
where q.sq in (
select sq FROM (
select c1, MIN(sq) as sq from t_input as t1
where not exists ( select sq FROM t_input as t2 WHERE t2.c2 = t1.c2 and t2.sq < t1.sq )
group by c1
union all
select c2, MIN(sq) as sq from t_input as t1
where not exists ( select sq FROM t_input as t2 WHERE t2.c3 = t1.c3 and t2.sq < t1.sq )
group by c2
union all
select c3, MIN(sq) as sq from t_input as t1
group by c3
) as z
group by sq having COUNT(*)=3 );
But that fails because you will be tempted to remedy this by putting in a set of “where not exists” clauses within each inner select, but this just does one more step – you will need to have “where not exists” clauses within those, ad infinitum.
Honor Super Omnia-
Jason Miller
November 17, 2010 at 10:51 am
This can be done as a set based operation. It is not trivial but this looks like a homework problem so I think you should try to work it out on your own.
The probability of survival is inversely proportional to the angle of arrival.
November 17, 2010 at 11:22 am
A few things in this make me question the proc in general.
Jason Miller-476791 (11/17/2010)
CREATE UNIQUE INDEX ix1 on #t1 ( c1 ) ;CREATE UNIQUE INDEX ix2 on #t1 ( c2 ) ;
CREATE UNIQUE INDEX ix3 on #t1 ( c3 ) ;
I realize this is a temp table but every column being unique outside of combinations seems like the data itself is nonrelational. Thus, what's the point of the data?
Moving on...
insert into #t1
select z.*
FROM t_input as z
WHERE not exists ( SELECT * FROM #t1 WHERE #t1.c1 = z.c1 ) and
not exists ( SELECT * FROM #t1 WHERE #t1.c2 = z.c2 ) and
not exists ( SELECT * FROM #t1 WHERE #t1.c3 = z.c3 )
and z.sq = zzz;
set zzz := zzz + 1;
What business rule are you trying to handle here? This is quite odd.
I've been trying to come up with a single statement that answers this. Problem is, it must run on version back to 2K.
Which can be done. The only way I know to approach it is insanely complex, but it can be done.
So, to bring us back to the beginning... what is the business purpose of this? It almost seems like the approach itself is poor, and should be rebuilt from the ground up.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 17, 2010 at 11:31 am
Craig Farrell (11/17/2010)
A few things in this make me question the proc in general.Jason Miller-476791 (11/17/2010)
CREATE UNIQUE INDEX ix1 on #t1 ( c1 ) ;CREATE UNIQUE INDEX ix2 on #t1 ( c2 ) ;
CREATE UNIQUE INDEX ix3 on #t1 ( c3 ) ;
What business rule are you trying to handle here? This is quite odd.
Basically the columns are order numbers that come in from different streams. This process removes dupes for a matching process.
Craig Farrell (11/17/2010)It almost seems like the approach itself is poor, and should be rebuilt from the ground up.
I tend to agree with you. Unfortunately, I don't have that luxury. I am given the set and told it takes too long (believe it or not, my solution there is much quicker).
Honor Super Omnia-
Jason Miller
November 17, 2010 at 12:28 pm
Alright, another business logic question, then.
The way you're approaching it, with this dataset:
( 1, 2, 3 ),
( 4, 1, 6),
( 1, 7, 8),
( 9, 2, 10),
(11,12, 6),
(13,14, 15),
( 9,20, 21),
(22,23, 3);
The '1' in Rows one and two repeat in different columns. This is acceptable I assume.
However, the 7 in column two will never be reported, it's completely removed from the result set and will never appear. This is desired?
I'm still curious why these are all on the same row if they've got no relation in the first place.
Anyway, the method to deal with this in your way in a single rowset would be like the query that follows. Please note, this result is still a hair off, just presenting the core of the methodology. The problem is that in your data Row 1 excludes row 4 from the second column. However, this process doesn't realize that directly, so row 4 excludes row 8 (9,20,21).
To deal with that, you'd have to include the logic of the process before it into the subselect under it. It won't be pretty, and if I find some more spare time I'll try to include that, but it'll get a bit wild. I want to make sure I have the rest of the assumptions accurate.
SELECT
t2.*
FROM
#t_input AS t2
JOIN
(SELECT
i1,
MIN( z) AS MinRow
from
#t_input AS t
GROUP BY
i1
) AS i1drv
ONt2.i1 = i1drv.i1
AND t2.z = i1drv.Minrow
JOIN
(SELECT
i2,
MIN( z) AS MinRow
from
#t_input AS t
GROUP BY
i2
) AS i2drv
ONt2.i2 = i2drv.i2
AND t2.z = i2drv.Minrow
JOIN
(SELECT
i3,
MIN( z) AS MinRow
from
#t_input AS t
GROUP BY
i3
) AS i3drv
ONt2.i3 = i3drv.i3
AND t2.z = i3drv.Minrow
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 17, 2010 at 12:53 pm
>The '1' in Rows one and two repeat in different columns. This is acceptable I assume.
Yes that is fine.
>However, the 7 in column two will never be reported, it's completely removed from the result set and will never appear. This is desired?
Yes, that is what I'm told.
>I'm still curious why these are all on the same row if they've got no relation in the first place.
There is a reason from the market streams, but I dont quite know that much of the puzzle.
>Anyway, ...To deal with that, you'd have to include the logic of the process before it into the subselect under it.
I think that'll fail because you will be tempted to remedy this by putting in sets where/join clauses within each inner select, but this just does one more step – you will need to have clauses within those, ad infinitum.
(I have to get the most mileage out of that statement, I owe credit to the author jw)
The process is trivial by eye, but this is happening on a larger scale... 10-20 million rows every hour. And this is just one small part of a larger process.
Honor Super Omnia-
Jason Miller
November 17, 2010 at 1:40 pm
CELKO (11/17/2010)
How about this version?SELECT * FROM ThreeStreams;
1231
4162
NULL 783
9NULL 104
1112NULL 5
1314156
NULL 20217
2223NULL 8
I just nulled out the later dups in each column, based on the sequence number
Close.
The 7th row is excluded/nulled. It should not be. The "9" in the first column of row 7 is acceptable because we are not considering row 4 due to its exclusion on the duplication of "2" on column 2.
Honor Super Omnia-
Jason Miller
November 17, 2010 at 3:57 pm
Jason Miller-476791 (11/17/2010)
I think that'll fail because you will be tempted to remedy this by putting in sets where/join clauses within each inner select, but this just does one more step – you will need to have clauses within those, ad infinitum.(I have to get the most mileage out of that statement, I owe credit to the author jw)
I believe you're correct. The requirements to this pretty much fall into the 1% of actual procedural loop work. Each next line must be computed based on the efforts of all the work done previously, with each next row possibly using a different dataset, which breaks the rules. The very definition of when you need iterative processing.
Someone may have a tricksy way out of this, but I don't see it. If it wasn't for those cross column rule modifications I have a way to deal with this, but not bi directionally where colA affects ColB results and ColB affects ColA's results.
Sorry. Good luck with this one.
I have an idea with the serial update, but I can't remember the mechanics of a previous row's editing of a self join's effect, if the updates are re-read on a per serial component evaluation. It'd basically just RBAR for you, but you would have less control. In this case, I wouldn't suggest doing that, but it might be a fun intellectual exercise.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 18, 2010 at 4:31 am
Not very efficient and SQL Server 2005/2008 only
WITH CTE AS (
SELECT c1,
c2,
c3,
ROW_NUMBER() OVER(PARTITION BY c1 ORDER BY sq) AS c1_rn,
ROW_NUMBER() OVER(PARTITION BY c2 ORDER BY sq) AS c2_rn,
ROW_NUMBER() OVER(PARTITION BY c3 ORDER BY sq) AS c3_rn,
sq
FROM t_input)
SELECT x.sq,
x.c1,
x.c2,
x.c3
FROM CTE x
WHERE NOT EXISTS (SELECT * FROM CTE y1
WHERE y1.sq<x.sq
AND y1.c1_rn=1 AND y1.c2_rn=1 AND y1.c3_rn=1
AND (y1.c1=x.c1 OR y1.c2=x.c2 OR y1.c3=x.c3))
ORDER BY x.sq;
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply