April 29, 2011 at 3:14 pm
create table orders(
oid int,
item int
)
insert into orders (oid, itemid)
values (1,1111)
insert into orders (oid, itemid)
values (1,2222)
insert into orders (oid, itemid)
values (1,3333)
insert into orders (oid, itemid)
values (2,1111)
insert into orders (oid, itemid)
values (2,2222)
insert into orders (oid, itemid)
values (3,1111)
insert into orders (oid, itemid)
values (3,2222)
insert into orders (oid, itemid)
values (3,3333)
insert into orders (oid, itemid)
values (4,2222)
insert into orders (oid, itemid)
values (4,4444)
insert into orders (oid, itemid)
values (4,5555)
insert into orders (oid, itemid)
values (4,6666)
I'm trying to figure out how to write a query that would return oid 1 and oid 3 only as those are the only two orders that have exactly the same items on the order.
The specific items and the number of different items on the order will be unknown ahead of time and the goal is to find only orders that exactly match the same item list as other orders. Qty of each item doesn't matter.
Ideally, it would be nice to specify the number of distinct items that should match as a minimum (IE: all orders that are exactly the same and have at least 3 items, all orders that are exactly the same and have at least 6 items, etc.)
April 29, 2011 at 3:48 pm
Easiest way I know to do this is a dirty trick using FOR XML.
IF OBJECT_ID ( 'tempdb..#orders') is not null
drop table #orders
create table #orders(
oid int,
itemid int
)
insert into #orders (oid, itemid)
values (1,1111)
insert into #orders (oid, itemid)
values (1,2222)
insert into #orders (oid, itemid)
values (1,3333)
insert into #orders (oid, itemid)
values (2,1111)
insert into #orders (oid, itemid)
values (2,2222)
insert into #orders (oid, itemid)
values (3,1111)
insert into #orders (oid, itemid)
values (3,2222)
insert into #orders (oid, itemid)
values (3,3333)
insert into #orders (oid, itemid)
values (4,2222)
insert into #orders (oid, itemid)
values (4,4444)
insert into #orders (oid, itemid)
values (4,5555)
insert into #orders (oid, itemid)
values (4,6666)
select * from #orders
;WITH cte AS (
select DISTINCT
OID,
(SELECT CONVERT( VARCHAR(100), itemID) + ',' FROM #orders AS o2 WHERE o2.oid = o1.oid ORDER BY o2.ItemID FOR XML PATH('')) AS xmlItems
FROM
#orders AS o1
)
SELECT
c1.oid,
c2.oid
FROM
cte AS c1
JOIN
cte AS c2
ONc1.xmlItems = c2.xmlItems
WHERE
c1.oid <> c2.oid
AND c1.oid < c2.oid
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
April 30, 2011 at 2:29 am
removed. doesn't work under a different test condition...
April 30, 2011 at 4:27 am
LutzM (4/30/2011)
removed. doesn't work under a different test condition...
Well, Lutz, if it makes you feel any better, neither does mine. I forgot the order by in the correlated subquery for XML. It needs (between the WHERE and the FOR) ORDER BY ItemID.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
April 30, 2011 at 11:06 am
How accurate does it have to be, i.e. can it occasionally return false positives? Something like this can be very fast even for large results, but it is possible to get two checksums that are the same while the values differnet (though not the reverse):
With cks(OID, Cksum) as
( select OID, Checksum_agg(Itemid) as CKsum
from orders
group by OID
)
select c1.oid, 'Matches =>', c2.oid
from cks c1
inner join cks c2 on c1.cksum=c2.cksum and c1.OID<c2.OID
oid(No column name)oid
1Matches =>3
April 30, 2011 at 3:58 pm
Ferguson (4/30/2011)
How accurate does it have to be, i.e. can it occasionally return false positives? Something like this can be very fast even for large results, but it is possible to get two checksums that are the same while the values differnet (though not the reverse):With cks(OID, Cksum) as
( select OID, Checksum_agg(Itemid) as CKsum
from orders
group by OID
)
select c1.oid, 'Matches =>', c2.oid
from cks c1
inner join cks c2 on c1.cksum=c2.cksum and c1.OID<c2.OID
oid(No column name)oid
1Matches =>3
This is very nice! To catch any false positives, the output of the this CTE could perhaps be used as input to Craig Farrell's query (including the order by clause of course!) - so it would be fast and there would be no false positives.
I didn't know about checksum_agg function. Is there an alphabetical list of all SQL functions somewhere?
May 1, 2011 at 10:52 am
Oh, be real careful, folks. CheckSum_Agg produces duplicates far more than you would ever imagine. Let's do a "little" test. Let's build a million rows of data with approximately 100,000 OID's and approximately 1,000 different items.
/****************************************************************************************
Purpose:
Create a voluminous test table with various types of highly randomized data.
--Jeff Moden
****************************************************************************************/
--===== Conditionally drop the test table to make reruns easier
IF OBJECT_ID('dbo.JBMTest','U') IS NOT NULL
DROP TABLE dbo.JBMTest
;
--===== Create and populate a 1,000,000 row test table.
-- "SomeID" has a range of 1 to 1,000,000 unique numbers
-- "OID" has a range of 1 to 100,000 non-unique numbers
-- "ItemID" has a range of 1 to 1000
SELECT TOP 1000000
SomeID = IDENTITY(INT,1,1),
OID = ABS(CHECKSUM(NEWID()))%100000+1,
ItemID = ABS(CHECKSUM(NEWID()))%1000+1
INTO dbo.JBMTest
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
--===== Add a typical Primary Key.
ALTER TABLE dbo.JBMTest
ADD CONSTRAINT PK_JBMTest PRIMARY KEY CLUSTERED (SomeID)
;
Now, using a slightly modified version of the CheckSum_Agg code, let's list the OID and the CheckSum_Agg for each OID...
With cks(OID, Cksum) as
( select OID, Checksum_agg(Itemid) as CKsum
from dbo.JBMTest
group by OID
)
SELECT * FROM cks ORDER BY CkSum, OID
;
In theory, any two OID's that have the same CkSum value should have the same ItemId's. Pick any two OID's that have the same CkSum in the list above and put them in the "IN" portion of the following code and see how often that's simply not the case...
SELECT ItemId, OID
FROM dbo.JBMTest
WHERE OID IN (161,283) --<===== Change these two numbers to two OIDs having the same CkSum
ORDER BY ItemID, OID
;
--Jeff Moden
Change is inevitable... Change for the better is not.
May 1, 2011 at 12:07 pm
CELKO (4/30/2011)
...but this should be pretty fast.
It's not. On a lousy million rows, your code will generate 500,000,000,000 internal rows and take a month of Sundays to run because you built in a nasty bit of procedural code known as a Triangular Join which is as bad as a "Bubble Sort". That's once you fix the spelling errors in the code which keeps it from running as is.
Before you go whacking on people's naming conventions and all the hooie about UPC codes, you need to clean up your act first.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 2, 2011 at 1:32 pm
Todd,
This should give you what you want. It gives each order and its items for only orders that have the same items.
; WITH CTEItems AS
( SELECT oid, itemid
, COUNT(*) OVER (PARTITION BY oid) AS ItemCount
FROM orders
)
--SELECT * FROM CTEItems
SELECT C1.oid, C1.itemid
FROM CTEItems C1
INNER JOIN CTEItems C2 ON
C1.ItemCount = C2.ItemCount
AND C1.itemid = C2.itemid
AND C1.oid <> C2.oid
Todd Fifield
May 2, 2011 at 2:06 pm
tfifield (5/2/2011)
Todd,This should give you what you want. It gives each order and its items for only orders that have the same items.
Todd, that can get fouled up by slightly different item lists.
IE: oid 1 with 111, 222, 333 and oid 2 with 111, 222, and 444.
You'll have the same counts, and your join will still return 2 of the 3 rows, possibly showing poor data results.
To do it the way you're describing, you'd need to nearly crossjoin the table to itself on itemids, take the count of THAT result per oid pairing, then return to the table and compare the counts of the pair to each independent piece, confirming they all match still. It gets nasty fast.
It would end up looking something like this:
;WITH cte AS (
SELECT
o1.oid AS oid1,
o2.oid AS oid2,
COUNT(*) AS cntForPair
FROM
#orders AS o1
JOIN
#orders AS o2
ONo1.oid <> o2.oid
AND o1.itemID = o2.itemID
GROUP BY
o1.oid,
o2.oid
),
OrderAgg AS (
SELECT
oid,
COUNT(*) AS oCount
FROM
#orders
GROUP BY
oid
)
--select * from orderagg
SELECT
cte.*
FROM
cte
JOIN
orderAgg AS oa1
ONcte.oid1 = oa1.oid
JOIN
orderAgg AS oa2
ONcte.oid2 = oa2.oid
WHERE
cte.cntForPair = oa1.oCount
AND cte.cntForPair = oa2.oCount
AND cte.oid1 < cte.oid2
EDIT: This has been going longer then I expected it, so I edited my original answer post to include the ORDER BY to avoid issues.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
May 3, 2011 at 6:10 am
See this topic. We hammered the subjet to death...
Using Adventureworks
SELECTx.PurchaseOrderID AS LowID,
y.PurchaseOrderID AS HighID,
MIN(x.ProductItems) AS ProductItems
INTO#Stage
FROM(
SELECTPurchaseOrderID,
ProductID,
COUNT(*) OVER (PARTITION BY PurchaseOrderID) AS ProductItems
FROMPurchasing.PurchaseOrderDetail
) AS x
INNER JOIN(
SELECTPurchaseOrderID,
ProductID,
COUNT(*) OVER (PARTITION BY PurchaseOrderID) AS ProductItems
FROMPurchasing.PurchaseOrderDetail
) AS y ON y.ProductID = x.ProductID
WHEREx.PurchaseOrderID < y.PurchaseOrderID
GROUP BYx.PurchaseOrderID,
y.PurchaseOrderID
HAVINGCOUNT(*) = MIN(x.ProductItems)
AND COUNT(*) = MIN(y.ProductItems)
CREATE INDEX IX_Low ON #Stage (LowID) INCLUDE (HighID, ProductItems)
CREATE INDEX IX_High ON #Stage (HighID) INCLUDE (LowID, ProductItems)
;WITH cteDuplicate
AS (
SELECTs.LowID AS theGrp,
s.LowID AS PurchaseOrderID,
MIN(s.ProductItems) AS ProductItems
,1 S SeqIDFROM#Stage AS s
WHERENOT EXISTS (SELECT * FROM #Stage AS x WHERE x.HighID = s.LowID)
GROUP BYs.LowID
UNION ALL
SELECTd.theGrp,
f.ID AS PurchaseOrderID,
d.ProductItems
,d.SeqID + 1 AS SeqIDFROMcteDuplicate AS d
CROSS APPLY(
SELECTs.HighID,
ROW_NUMBER() OVER (ORDER BY s.HighID) AS SeqID
FROM#Stage AS s
WHEREs.LowID = d.PurchaseOrderID
) AS f(ID, SeqID)
WHEREf.SeqID = 1
)
SELECTDENSE_RANK() OVER (ORDER BY theGrp) AS theGroup,
ProductItems,
SeqID,
COUNT(*) OVER (PARTITION BY theGrp) AS PurchaseOrderItems,
PurchaseOrderID
INTO#Temp
FROMcteDuplicate
OPTION(MAXRECURSION 0)
DROP TABLE#Stage
SELECTw.ProductItems,
STUFF(w.Data, 1, 2, '') AS Products,
w.PurchaseOrderItems,
STUFF(f.Data, 1, 2, '') AS PurchaseOrders
FROM(
SELECTt.theGroup,
t.ProductItems,
(
SELECT', ' + CAST(pod.ProductID AS VARCHAR(12))
FROMPurchasing.PurchaseOrderDetail AS pod
WHEREpod.PurchaseOrderID = t.PurchaseOrderID
ORDER BYpod.ProductID
FOR XMLPATH('')
) AS Data,
t.PurchaseOrderItems
FROM#Temp AS t
WHEREt.SeqID = 1
) AS w
CROSS APPLY(
SELECT', ' + CAST(t.PurchaseOrderID AS VARCHAR(12))
FROM#Temp AS t
WHEREt.theGroup = w.theGroup
ORDER BYt.PurchaseOrderID
FOR XMLPATH('')
) AS f(Data)
ORDER BYw.ProductItems,
w.PurchaseOrderItems
DROP TABLE#Temp
N 56°04'39.16"
E 12°55'05.25"
May 3, 2011 at 6:55 am
Here's a basic idea that works with the sample data that was supplied...
CREATE FUNCTION dbo.fnOrders
(
@oid int
)
RETURNS nvarchar(4000)
AS BEGIN
DECLARE @orders_out nvarchar(4000)
SELECT @orders_out = ISNULL(@orders_out,'') + ISNULL(CAST(itemid AS nvarchar),'') + ';' FROM orders WHERE oid = @oid
RETURN @orders_out
END
GO
...and then use that function in this query:
SELECT DISTINCT
oid,
dbo.fnOrders(oid) AS orders_out
INTO
#temp
FROM
orders
SELECT
oid,
orders_out
FROM
#temp
WHERE
orders_out IN
(
SELECT orders_out
FROM #temp
GROUP BY orders_out
HAVING COUNT(*) > 1
)
DROP TABLE #temp
May 3, 2011 at 7:59 am
Here's another version using APPLY without any sneaky tricks.
DROP table #orders
create table #orders(
oid int,
itemid int
)
insert into #orders (oid, itemid)
values (1,1111)
insert into #orders (oid, itemid)
values (1,2222)
insert into #orders (oid, itemid)
values (1,3333)
insert into #orders (oid, itemid)
values (2,1111)
insert into #orders (oid, itemid)
values (2,2222)
insert into #orders (oid, itemid)
values (3,1111)
insert into #orders (oid, itemid)
values (3,2232)
insert into #orders (oid, itemid)
values (3,3333)
insert into #orders (oid, itemid)
values (4,2222)
insert into #orders (oid, itemid)
values (4,4444)
insert into #orders (oid, itemid)
values (4,5555)
insert into #orders (oid, itemid)
values (4,6666)
insert into #orders (oid, itemid)
values (5,1111)
insert into #orders (oid, itemid)
values (5,2232)
insert into #orders (oid, itemid)
values (5,3333)
SELECT oid, MatchingOid
FROM (
SELECT o1.oid, o1.itemid, o1.ItemRows, MatchingOid = o2.oid, o3.ItemMatch
FROM (SELECT oid, itemid, ItemRows = COUNT(*) OVER (PARTITION BY oid) FROM #orders) o1
INNER JOIN (SELECT oid, ItemRows = COUNT(*) FROM #orders GROUP BY oid) o2 ON o1.ItemRows = o2.ItemRows AND o1.oid <> o2.oid
CROSS APPLY (SELECT ItemMatch = COUNT(*) FROM #orders WHERE oid = o2.oid AND itemid = o1.itemid) o3
) d
GROUP BY oid, MatchingOid
HAVING MAX(ItemRows) = SUM(ItemMatch)
Run the inner select to see how it works.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
May 3, 2011 at 6:28 pm
ChrisM@home (5/3/2011)
Here's another version using APPLY without any sneaky tricks.
Chris, run an estimated execution plan of your code against the following 10,000 rows of data and look at the row counts for the arrows...
/****************************************************************************************
Purpose:
Create a voluminous test table with various types of highly randomized data.
--Jeff Moden
****************************************************************************************/
--===== Conditionally drop the test table to make reruns easier
IF OBJECT_ID('tempdb..#orders','U') IS NOT NULL
DROP TABLE #orders
;
--===== Create and populate a 10,000 row test table.
-- "SomeID" has a range of 1 to 10,000 unique numbers
-- "OID" has a range of 1 to 100,000 non-unique numbers
-- "ItemID" has a range of 1 to 1000
SELECT TOP 10000
SomeID = IDENTITY(INT,1,1),
OID = ABS(CHECKSUM(NEWID()))%100000+1,
ItemID = ABS(CHECKSUM(NEWID()))%1000+1
INTO #orders
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
--===== Add a typical Primary Key.
ALTER TABLE #orders
ADD PRIMARY KEY CLUSTERED (SomeID)
;
--Jeff Moden
Change is inevitable... Change for the better is not.
May 3, 2011 at 6:48 pm
tfifield (5/2/2011)
Todd,This should give you what you want. It gives each order and its items for only orders that have the same items.
; WITH CTEItems AS
( SELECT oid, itemid
, COUNT(*) OVER (PARTITION BY oid) AS ItemCount
FROM orders
)
--SELECT * FROM CTEItems
SELECT C1.oid, C1.itemid
FROM CTEItems C1
INNER JOIN CTEItems C2 ON
C1.ItemCount = C2.ItemCount
AND C1.itemid = C2.itemid
AND C1.oid <> C2.oid
Todd Fifield
Try it with more data and an ORDER BY ItemId, OID and see what happens when you use the following data, Todd.
/****************************************************************************************
Purpose:
Create a voluminous test table with various types of highly randomized data.
--Jeff Moden
****************************************************************************************/
--===== Conditionally drop the test table to make reruns easier
IF OBJECT_ID('orders','U') IS NOT NULL
DROP TABLE orders
;
--===== Create and populate a 10,000 row test table.
-- "SomeID" has a range of 1 to 10,000 unique numbers
-- "OID" has a range of 1 to 100,000 non-unique numbers
-- "ItemID" has a range of 1 to 1000
SELECT TOP 10000
SomeID = IDENTITY(INT,1,1),
OID = ABS(CHECKSUM(NEWID()))%100000+1,
ItemID = ABS(CHECKSUM(NEWID()))%1000+1
INTO orders
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
--===== Add a typical Primary Key.
ALTER TABLE orders
ADD PRIMARY KEY CLUSTERED (SomeID)
;
Yes, I agree. A DISTINCT will fix that. Then, try it against just a million rows and see what happens. 😛 Here's the code to gen a million rows...
/****************************************************************************************
Purpose:
Create a voluminous test table with various types of highly randomized data.
--Jeff Moden
****************************************************************************************/
--===== Conditionally drop the test table to make reruns easier
IF OBJECT_ID('orders','U') IS NOT NULL
DROP TABLE orders
;
--===== Create and populate a 1,000,000 row test table.
-- "SomeID" has a range of 1 to 1,000,000 unique numbers
-- "OID" has a range of 1 to 100,000 non-unique numbers
-- "ItemID" has a range of 1 to 1000
SELECT TOP 1000000
SomeID = IDENTITY(INT,1,1),
OID = ABS(CHECKSUM(NEWID()))%100000+1,
ItemID = ABS(CHECKSUM(NEWID()))%1000+1
INTO orders
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
--===== Add a typical Primary Key.
ALTER TABLE orders
ADD PRIMARY KEY CLUSTERED (SomeID)
;
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 28 total)
You must be logged in to reply to this topic. Login to reply