Find all orders that have EXACTLY the same items

Question

Post reply

Find all orders that have EXACTLY the same items

Todd Mathews

Grasshopper

Points: 22
More actions
April 29, 2011 at 3:14 pm

#236634

create table orders(
oid int,
item int
)
insert into orders (oid, itemid)
values (1,1111)
insert into orders (oid, itemid)
values (1,2222)
insert into orders (oid, itemid)
values (1,3333)
insert into orders (oid, itemid)
values (2,1111)
insert into orders (oid, itemid)
values (2,2222)
insert into orders (oid, itemid)
values (3,1111)
insert into orders (oid, itemid)
values (3,2222)
insert into orders (oid, itemid)
values (3,3333)
insert into orders (oid, itemid)
values (4,2222)
insert into orders (oid, itemid)
values (4,4444)
insert into orders (oid, itemid)
values (4,5555)
insert into orders (oid, itemid)
values (4,6666)
I'm trying to figure out how to write a query that would return oid 1 and oid 3 only as those are the only two orders that have exactly the same items on the order.
The specific items and the number of different items on the order will be unknown ahead of time and the goal is to find only orders that exactly match the same item list as other orders. Qty of each item doesn't matter.
Ideally, it would be nice to specify the number of distinct items that should match as a minimum (IE: all orders that are exactly the same and have at least 3 items, all orders that are exactly the same and have at least 6 items, etc.)

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 1

Easiest way I know to do this is a dirty trick using FOR XML.

IF OBJECT_ID ( 'tempdb..#orders') is not null

drop table #orders

create table #orders(

oid int,

itemid int

)

insert into #orders (oid, itemid)

values (1,1111)

insert into #orders (oid, itemid)

values (1,2222)

insert into #orders (oid, itemid)

values (1,3333)

insert into #orders (oid, itemid)

values (2,1111)

insert into #orders (oid, itemid)

values (2,2222)

insert into #orders (oid, itemid)

values (3,1111)

insert into #orders (oid, itemid)

values (3,2222)

insert into #orders (oid, itemid)

values (3,3333)

insert into #orders (oid, itemid)

values (4,2222)

insert into #orders (oid, itemid)

values (4,4444)

insert into #orders (oid, itemid)

values (4,5555)

insert into #orders (oid, itemid)

values (4,6666)

select * from #orders

;WITH cte AS (

select DISTINCT

OID,

(SELECT CONVERT( VARCHAR(100), itemID) + ',' FROM #orders AS o2 WHERE o2.oid = o1.oid ORDER BY o2.ItemID FOR XML PATH('')) AS xmlItems

FROM

#orders AS o1

)

SELECT

c1.oid,

c2.oid

FROM

cte AS c1

JOIN

cte AS c2

ONc1.xmlItems = c2.xmlItems

WHERE

c1.oid <> c2.oid

AND c1.oid < c2.oid

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

LutzM SSC Guru Points: 107049 More actions · Answer 2

removed. doesn't work under a different test condition...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 3

LutzM (4/30/2011)
removed. doesn't work under a different test condition...

Well, Lutz, if it makes you feel any better, neither does mine. I forgot the order by in the correlated subquery for XML. It needs (between the WHERE and the FOR) ORDER BY ItemID.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Ferguson SSCertifiable Points: 5837 More actions · Answer 4

How accurate does it have to be, i.e. can it occasionally return false positives? Something like this can be very fast even for large results, but it is possible to get two checksums that are the same while the values differnet (though not the reverse):

With cks(OID, Cksum) as

( select OID, Checksum_agg(Itemid) as CKsum

from orders

group by OID

)

select c1.oid, 'Matches =>', c2.oid

from cks c1

inner join cks c2 on c1.cksum=c2.cksum and c1.OID<c2.OID

oid(No column name)oid

1Matches =>3

SunitaBeck SSC Veteran Points: 277 More actions · Answer 5

Ferguson (4/30/2011)
How accurate does it have to be, i.e. can it occasionally return false positives? Something like this can be very fast even for large results, but it is possible to get two checksums that are the same while the values differnet (though not the reverse):
With cks(OID, Cksum) as
( select OID, Checksum_agg(Itemid) as CKsum
from orders
group by OID
)
select c1.oid, 'Matches =>', c2.oid
from cks c1
inner join cks c2 on c1.cksum=c2.cksum and c1.OID<c2.OID
oid(No column name)oid
1Matches =>3

This is very nice! To catch any false positives, the output of the this CTE could perhaps be used as input to Craig Farrell's query (including the order by clause of course!) - so it would be fast and there would be no false positives.

I didn't know about checksum_agg function. Is there an alphabetical list of all SQL functions somewhere?

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 6

Oh, be real careful, folks. CheckSum_Agg produces duplicates far more than you would ever imagine. Let's do a "little" test. Let's build a million rows of data with approximately 100,000 OID's and approximately 1,000 different items.

/****************************************************************************************

Purpose:

Create a voluminous test table with various types of highly randomized data.

--Jeff Moden

****************************************************************************************/

--===== Conditionally drop the test table to make reruns easier

IF OBJECT_ID('dbo.JBMTest','U') IS NOT NULL

DROP TABLE dbo.JBMTest

;

--===== Create and populate a 1,000,000 row test table.

-- "SomeID" has a range of 1 to 1,000,000 unique numbers

-- "OID" has a range of 1 to 100,000 non-unique numbers

-- "ItemID" has a range of 1 to 1000

SELECT TOP 1000000

SomeID = IDENTITY(INT,1,1),

OID = ABS(CHECKSUM(NEWID()))%100000+1,

ItemID = ABS(CHECKSUM(NEWID()))%1000+1

INTO dbo.JBMTest

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

--===== Add a typical Primary Key.

ALTER TABLE dbo.JBMTest

ADD CONSTRAINT PK_JBMTest PRIMARY KEY CLUSTERED (SomeID)

;

Now, using a slightly modified version of the CheckSum_Agg code, let's list the OID and the CheckSum_Agg for each OID...

With cks(OID, Cksum) as

( select OID, Checksum_agg(Itemid) as CKsum

from dbo.JBMTest

group by OID

)

SELECT * FROM cks ORDER BY CkSum, OID

;

In theory, any two OID's that have the same CkSum value should have the same ItemId's. Pick any two OID's that have the same CkSum in the list above and put them in the "IN" portion of the following code and see how often that's simply not the case...

SELECT ItemId, OID

FROM dbo.JBMTest

WHERE OID IN (161,283) --<===== Change these two numbers to two OIDs having the same CkSum

ORDER BY ItemID, OID

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 7

CELKO (4/30/2011)
...but this should be pretty fast.

It's not. On a lousy million rows, your code will generate 500,000,000,000 internal rows and take a month of Sundays to run because you built in a nasty bit of procedural code known as a Triangular Join which is as bad as a "Bubble Sort". That's once you fix the spelling errors in the code which keeps it from running as is.

Before you go whacking on people's naming conventions and all the hooie about UPC codes, you need to clean up your act first.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

tfifield SSCrazy Eights Points: 9655 More actions · Answer 8

Todd,

This should give you what you want. It gives each order and its items for only orders that have the same items.

; WITH CTEItems AS

( SELECT oid, itemid

, COUNT(*) OVER (PARTITION BY oid) AS ItemCount

FROM orders

)

--SELECT * FROM CTEItems

SELECT C1.oid, C1.itemid

FROM CTEItems C1

INNER JOIN CTEItems C2 ON

C1.ItemCount = C2.ItemCount

AND C1.itemid = C2.itemid

AND C1.oid <> C2.oid

Todd Fifield

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 9

tfifield (5/2/2011)
Todd,
This should give you what you want. It gives each order and its items for only orders that have the same items.

Todd, that can get fouled up by slightly different item lists.

IE: oid 1 with 111, 222, 333 and oid 2 with 111, 222, and 444.

You'll have the same counts, and your join will still return 2 of the 3 rows, possibly showing poor data results.

To do it the way you're describing, you'd need to nearly crossjoin the table to itself on itemids, take the count of THAT result per oid pairing, then return to the table and compare the counts of the pair to each independent piece, confirming they all match still. It gets nasty fast.

It would end up looking something like this:

;WITH cte AS (

SELECT

o1.oid AS oid1,

o2.oid AS oid2,

COUNT(*) AS cntForPair

FROM

#orders AS o1

JOIN

#orders AS o2

ONo1.oid <> o2.oid

AND o1.itemID = o2.itemID

GROUP BY

o1.oid,

o2.oid

),

OrderAgg AS (

SELECT

oid,

COUNT(*) AS oCount

FROM

#orders

GROUP BY

oid

)

--select * from orderagg

SELECT

cte.*

FROM

cte

JOIN

orderAgg AS oa1

ONcte.oid1 = oa1.oid

JOIN

orderAgg AS oa2

ONcte.oid2 = oa2.oid

WHERE

cte.cntForPair = oa1.oCount

AND cte.cntForPair = oa2.oCount

AND cte.oid1 < cte.oid2

EDIT: This has been going longer then I expected it, so I edited my original answer post to include the ORDER BY to avoid issues.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

SwePeso SSC-Dedicated Points: 39757 More actions · Answer 10

See this topic. We hammered the subjet to death...

http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/c4d4f72c-35a7-43f9-8dbc-5517a14de5f2

Using Adventureworks

SELECTx.PurchaseOrderID AS LowID,

y.PurchaseOrderID AS HighID,

MIN(x.ProductItems) AS ProductItems

INTO#Stage

FROM(

SELECTPurchaseOrderID,

ProductID,

COUNT(*) OVER (PARTITION BY PurchaseOrderID) AS ProductItems

FROMPurchasing.PurchaseOrderDetail

) AS x

INNER JOIN(

SELECTPurchaseOrderID,

ProductID,

COUNT(*) OVER (PARTITION BY PurchaseOrderID) AS ProductItems

FROMPurchasing.PurchaseOrderDetail

) AS y ON y.ProductID = x.ProductID

WHEREx.PurchaseOrderID < y.PurchaseOrderID

GROUP BYx.PurchaseOrderID,

y.PurchaseOrderID

HAVINGCOUNT(*) = MIN(x.ProductItems)

AND COUNT(*) = MIN(y.ProductItems)

CREATE INDEX IX_Low ON #Stage (LowID) INCLUDE (HighID, ProductItems)

CREATE INDEX IX_High ON #Stage (HighID) INCLUDE (LowID, ProductItems)

;WITH cteDuplicate

AS (

SELECTs.LowID AS theGrp,

s.LowID AS PurchaseOrderID,

MIN(s.ProductItems) AS ProductItems

,1 S SeqIDFROM#Stage AS s

WHERENOT EXISTS (SELECT * FROM #Stage AS x WHERE x.HighID = s.LowID)

GROUP BYs.LowID

UNION ALL

SELECTd.theGrp,

f.ID AS PurchaseOrderID,

d.ProductItems

,d.SeqID + 1 AS SeqIDFROMcteDuplicate AS d

CROSS APPLY(

SELECTs.HighID,

ROW_NUMBER() OVER (ORDER BY s.HighID) AS SeqID

FROM#Stage AS s

WHEREs.LowID = d.PurchaseOrderID

) AS f(ID, SeqID)

WHEREf.SeqID = 1

)

SELECTDENSE_RANK() OVER (ORDER BY theGrp) AS theGroup,

ProductItems,

SeqID,

COUNT(*) OVER (PARTITION BY theGrp) AS PurchaseOrderItems,

PurchaseOrderID

INTO#Temp

FROMcteDuplicate

OPTION(MAXRECURSION 0)

DROP TABLE#Stage

SELECTw.ProductItems,

STUFF(w.Data, 1, 2, '') AS Products,

w.PurchaseOrderItems,

STUFF(f.Data, 1, 2, '') AS PurchaseOrders

FROM(

SELECTt.theGroup,

t.ProductItems,

(

SELECT', ' + CAST(pod.ProductID AS VARCHAR(12))

FROMPurchasing.PurchaseOrderDetail AS pod

WHEREpod.PurchaseOrderID = t.PurchaseOrderID

ORDER BYpod.ProductID

FOR XMLPATH('')

) AS Data,

t.PurchaseOrderItems

FROM#Temp AS t

WHEREt.SeqID = 1

) AS w

CROSS APPLY(

SELECT', ' + CAST(t.PurchaseOrderID AS VARCHAR(12))

FROM#Temp AS t

WHEREt.theGroup = w.theGroup

ORDER BYt.PurchaseOrderID

FOR XMLPATH('')

) AS f(Data)

ORDER BYw.ProductItems,

w.PurchaseOrderItems

DROP TABLE#Temp

N 56°04'39.16"
E 12°55'05.25"

WILLIAM MITCHELL SSChampion Points: 13766 More actions · Answer 11

Here's a basic idea that works with the sample data that was supplied...

CREATE FUNCTION dbo.fnOrders

(

@oid int

)

RETURNS nvarchar(4000)

AS BEGIN

DECLARE @orders_out nvarchar(4000)

SELECT @orders_out = ISNULL(@orders_out,'') + ISNULL(CAST(itemid AS nvarchar),'') + ';' FROM orders WHERE oid = @oid

RETURN @orders_out

END

GO

...and then use that function in this query:

SELECT DISTINCT

oid,

dbo.fnOrders(oid) AS orders_out

INTO

#temp

FROM

orders

SELECT

oid,

orders_out

FROM

#temp

WHERE

orders_out IN

(

SELECT orders_out

FROM #temp

GROUP BY orders_out

HAVING COUNT(*) > 1

)

DROP TABLE #temp

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 12

Here's another version using APPLY without any sneaky tricks.

DROP table #orders

create table #orders(

oid int,

itemid int

)

insert into #orders (oid, itemid)

values (1,1111)

insert into #orders (oid, itemid)

values (1,2222)

insert into #orders (oid, itemid)

values (1,3333)

insert into #orders (oid, itemid)

values (2,1111)

insert into #orders (oid, itemid)

values (2,2222)

insert into #orders (oid, itemid)

values (3,1111)

insert into #orders (oid, itemid)

values (3,2232)

insert into #orders (oid, itemid)

values (3,3333)

insert into #orders (oid, itemid)

values (4,2222)

insert into #orders (oid, itemid)

values (4,4444)

insert into #orders (oid, itemid)

values (4,5555)

insert into #orders (oid, itemid)

values (4,6666)

insert into #orders (oid, itemid)

values (5,1111)

insert into #orders (oid, itemid)

values (5,2232)

insert into #orders (oid, itemid)

values (5,3333)

SELECT oid, MatchingOid

FROM (

SELECT o1.oid, o1.itemid, o1.ItemRows, MatchingOid = o2.oid, o3.ItemMatch

FROM (SELECT oid, itemid, ItemRows = COUNT(*) OVER (PARTITION BY oid) FROM #orders) o1

INNER JOIN (SELECT oid, ItemRows = COUNT(*) FROM #orders GROUP BY oid) o2 ON o1.ItemRows = o2.ItemRows AND o1.oid <> o2.oid

CROSS APPLY (SELECT ItemMatch = COUNT(*) FROM #orders WHERE oid = o2.oid AND itemid = o1.itemid) o3

) d

GROUP BY oid, MatchingOid

HAVING MAX(ItemRows) = SUM(ItemMatch)

Run the inner select to see how it works.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 13

ChrisM@home (5/3/2011)
Here's another version using APPLY without any sneaky tricks.

Chris, run an estimated execution plan of your code against the following 10,000 rows of data and look at the row counts for the arrows...

/****************************************************************************************

Purpose:

Create a voluminous test table with various types of highly randomized data.

--Jeff Moden

****************************************************************************************/

--===== Conditionally drop the test table to make reruns easier

IF OBJECT_ID('tempdb..#orders','U') IS NOT NULL

DROP TABLE #orders

;

--===== Create and populate a 10,000 row test table.

-- "SomeID" has a range of 1 to 10,000 unique numbers

-- "OID" has a range of 1 to 100,000 non-unique numbers

-- "ItemID" has a range of 1 to 1000

SELECT TOP 10000

SomeID = IDENTITY(INT,1,1),

OID = ABS(CHECKSUM(NEWID()))%100000+1,

ItemID = ABS(CHECKSUM(NEWID()))%1000+1

INTO #orders

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

--===== Add a typical Primary Key.

ALTER TABLE #orders

ADD PRIMARY KEY CLUSTERED (SomeID)

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 14

tfifield (5/2/2011)
Todd,
This should give you what you want. It gives each order and its items for only orders that have the same items.
; WITH CTEItems AS
( SELECT oid, itemid
, COUNT(*) OVER (PARTITION BY oid) AS ItemCount
FROM orders
)
--SELECT * FROM CTEItems
SELECT C1.oid, C1.itemid
FROM CTEItems C1
INNER JOIN CTEItems C2 ON
C1.ItemCount = C2.ItemCount
AND C1.itemid = C2.itemid
AND C1.oid <> C2.oid
Todd Fifield

Try it with more data and an ORDER BY ItemId, OID and see what happens when you use the following data, Todd.

/****************************************************************************************

Purpose:

Create a voluminous test table with various types of highly randomized data.

--Jeff Moden

****************************************************************************************/

--===== Conditionally drop the test table to make reruns easier

IF OBJECT_ID('orders','U') IS NOT NULL

DROP TABLE orders

;

--===== Create and populate a 10,000 row test table.

-- "SomeID" has a range of 1 to 10,000 unique numbers

-- "OID" has a range of 1 to 100,000 non-unique numbers

-- "ItemID" has a range of 1 to 1000

SELECT TOP 10000

SomeID = IDENTITY(INT,1,1),

OID = ABS(CHECKSUM(NEWID()))%100000+1,

ItemID = ABS(CHECKSUM(NEWID()))%1000+1

INTO orders

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

--===== Add a typical Primary Key.

ALTER TABLE orders

ADD PRIMARY KEY CLUSTERED (SomeID)

;

Yes, I agree. A DISTINCT will fix that. Then, try it against just a million rows and see what happens. 😛 Here's the code to gen a million rows...

/****************************************************************************************

Purpose:

Create a voluminous test table with various types of highly randomized data.

--Jeff Moden

****************************************************************************************/

--===== Conditionally drop the test table to make reruns easier

IF OBJECT_ID('orders','U') IS NOT NULL

DROP TABLE orders

;

--===== Create and populate a 1,000,000 row test table.

-- "SomeID" has a range of 1 to 1,000,000 unique numbers

-- "OID" has a range of 1 to 100,000 non-unique numbers

-- "ItemID" has a range of 1 to 1000

SELECT TOP 1000000

SomeID = IDENTITY(INT,1,1),

OID = ABS(CHECKSUM(NEWID()))%100000+1,

ItemID = ABS(CHECKSUM(NEWID()))%1000+1

INTO orders

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

--===== Add a typical Primary Key.

ALTER TABLE orders

ADD PRIMARY KEY CLUSTERED (SomeID)

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)