Find all orders that have EXACTLY the same items

  • create table orders(

    oid int,

    item int

    )

    insert into orders (oid, itemid)

    values (1,1111)

    insert into orders (oid, itemid)

    values (1,2222)

    insert into orders (oid, itemid)

    values (1,3333)

    insert into orders (oid, itemid)

    values (2,1111)

    insert into orders (oid, itemid)

    values (2,2222)

    insert into orders (oid, itemid)

    values (3,1111)

    insert into orders (oid, itemid)

    values (3,2222)

    insert into orders (oid, itemid)

    values (3,3333)

    insert into orders (oid, itemid)

    values (4,2222)

    insert into orders (oid, itemid)

    values (4,4444)

    insert into orders (oid, itemid)

    values (4,5555)

    insert into orders (oid, itemid)

    values (4,6666)

    I'm trying to figure out how to write a query that would return oid 1 and oid 3 only as those are the only two orders that have exactly the same items on the order.

    The specific items and the number of different items on the order will be unknown ahead of time and the goal is to find only orders that exactly match the same item list as other orders. Qty of each item doesn't matter.

    Ideally, it would be nice to specify the number of distinct items that should match as a minimum (IE: all orders that are exactly the same and have at least 3 items, all orders that are exactly the same and have at least 6 items, etc.)

  • Easiest way I know to do this is a dirty trick using FOR XML.

    IF OBJECT_ID ( 'tempdb..#orders') is not null

    drop table #orders

    create table #orders(

    oid int,

    itemid int

    )

    insert into #orders (oid, itemid)

    values (1,1111)

    insert into #orders (oid, itemid)

    values (1,2222)

    insert into #orders (oid, itemid)

    values (1,3333)

    insert into #orders (oid, itemid)

    values (2,1111)

    insert into #orders (oid, itemid)

    values (2,2222)

    insert into #orders (oid, itemid)

    values (3,1111)

    insert into #orders (oid, itemid)

    values (3,2222)

    insert into #orders (oid, itemid)

    values (3,3333)

    insert into #orders (oid, itemid)

    values (4,2222)

    insert into #orders (oid, itemid)

    values (4,4444)

    insert into #orders (oid, itemid)

    values (4,5555)

    insert into #orders (oid, itemid)

    values (4,6666)

    select * from #orders

    ;WITH cte AS (

    select DISTINCT

    OID,

    (SELECT CONVERT( VARCHAR(100), itemID) + ',' FROM #orders AS o2 WHERE o2.oid = o1.oid ORDER BY o2.ItemID FOR XML PATH('')) AS xmlItems

    FROM

    #orders AS o1

    )

    SELECT

    c1.oid,

    c2.oid

    FROM

    cte AS c1

    JOIN

    cte AS c2

    ONc1.xmlItems = c2.xmlItems

    WHERE

    c1.oid <> c2.oid

    AND c1.oid < c2.oid


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • removed. doesn't work under a different test condition...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • LutzM (4/30/2011)


    removed. doesn't work under a different test condition...

    Well, Lutz, if it makes you feel any better, neither does mine. I forgot the order by in the correlated subquery for XML. It needs (between the WHERE and the FOR) ORDER BY ItemID.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • How accurate does it have to be, i.e. can it occasionally return false positives? Something like this can be very fast even for large results, but it is possible to get two checksums that are the same while the values differnet (though not the reverse):

    With cks(OID, Cksum) as

    ( select OID, Checksum_agg(Itemid) as CKsum

    from orders

    group by OID

    )

    select c1.oid, 'Matches =>', c2.oid

    from cks c1

    inner join cks c2 on c1.cksum=c2.cksum and c1.OID<c2.OID

    oid(No column name)oid

    1Matches =>3

  • Ferguson (4/30/2011)


    How accurate does it have to be, i.e. can it occasionally return false positives? Something like this can be very fast even for large results, but it is possible to get two checksums that are the same while the values differnet (though not the reverse):

    With cks(OID, Cksum) as

    ( select OID, Checksum_agg(Itemid) as CKsum

    from orders

    group by OID

    )

    select c1.oid, 'Matches =>', c2.oid

    from cks c1

    inner join cks c2 on c1.cksum=c2.cksum and c1.OID<c2.OID

    oid(No column name)oid

    1Matches =>3

    This is very nice! To catch any false positives, the output of the this CTE could perhaps be used as input to Craig Farrell's query (including the order by clause of course!) - so it would be fast and there would be no false positives.

    I didn't know about checksum_agg function. Is there an alphabetical list of all SQL functions somewhere?

  • Oh, be real careful, folks. CheckSum_Agg produces duplicates far more than you would ever imagine. Let's do a "little" test. Let's build a million rows of data with approximately 100,000 OID's and approximately 1,000 different items.

    /****************************************************************************************

    Purpose:

    Create a voluminous test table with various types of highly randomized data.

    --Jeff Moden

    ****************************************************************************************/

    --===== Conditionally drop the test table to make reruns easier

    IF OBJECT_ID('dbo.JBMTest','U') IS NOT NULL

    DROP TABLE dbo.JBMTest

    ;

    --===== Create and populate a 1,000,000 row test table.

    -- "SomeID" has a range of 1 to 1,000,000 unique numbers

    -- "OID" has a range of 1 to 100,000 non-unique numbers

    -- "ItemID" has a range of 1 to 1000

    SELECT TOP 1000000

    SomeID = IDENTITY(INT,1,1),

    OID = ABS(CHECKSUM(NEWID()))%100000+1,

    ItemID = ABS(CHECKSUM(NEWID()))%1000+1

    INTO dbo.JBMTest

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    --===== Add a typical Primary Key.

    ALTER TABLE dbo.JBMTest

    ADD CONSTRAINT PK_JBMTest PRIMARY KEY CLUSTERED (SomeID)

    ;

    Now, using a slightly modified version of the CheckSum_Agg code, let's list the OID and the CheckSum_Agg for each OID...

    With cks(OID, Cksum) as

    ( select OID, Checksum_agg(Itemid) as CKsum

    from dbo.JBMTest

    group by OID

    )

    SELECT * FROM cks ORDER BY CkSum, OID

    ;

    In theory, any two OID's that have the same CkSum value should have the same ItemId's. Pick any two OID's that have the same CkSum in the list above and put them in the "IN" portion of the following code and see how often that's simply not the case...

    SELECT ItemId, OID

    FROM dbo.JBMTest

    WHERE OID IN (161,283) --<===== Change these two numbers to two OIDs having the same CkSum

    ORDER BY ItemID, OID

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • CELKO (4/30/2011)


    ...but this should be pretty fast.

    It's not. On a lousy million rows, your code will generate 500,000,000,000 internal rows and take a month of Sundays to run because you built in a nasty bit of procedural code known as a Triangular Join which is as bad as a "Bubble Sort". That's once you fix the spelling errors in the code which keeps it from running as is.

    Before you go whacking on people's naming conventions and all the hooie about UPC codes, you need to clean up your act first.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Todd,

    This should give you what you want. It gives each order and its items for only orders that have the same items.

    ; WITH CTEItems AS

    ( SELECT oid, itemid

    , COUNT(*) OVER (PARTITION BY oid) AS ItemCount

    FROM orders

    )

    --SELECT * FROM CTEItems

    SELECT C1.oid, C1.itemid

    FROM CTEItems C1

    INNER JOIN CTEItems C2 ON

    C1.ItemCount = C2.ItemCount

    AND C1.itemid = C2.itemid

    AND C1.oid <> C2.oid

    Todd Fifield

  • tfifield (5/2/2011)


    Todd,

    This should give you what you want. It gives each order and its items for only orders that have the same items.

    Todd, that can get fouled up by slightly different item lists.

    IE: oid 1 with 111, 222, 333 and oid 2 with 111, 222, and 444.

    You'll have the same counts, and your join will still return 2 of the 3 rows, possibly showing poor data results.

    To do it the way you're describing, you'd need to nearly crossjoin the table to itself on itemids, take the count of THAT result per oid pairing, then return to the table and compare the counts of the pair to each independent piece, confirming they all match still. It gets nasty fast.

    It would end up looking something like this:

    ;WITH cte AS (

    SELECT

    o1.oid AS oid1,

    o2.oid AS oid2,

    COUNT(*) AS cntForPair

    FROM

    #orders AS o1

    JOIN

    #orders AS o2

    ONo1.oid <> o2.oid

    AND o1.itemID = o2.itemID

    GROUP BY

    o1.oid,

    o2.oid

    ),

    OrderAgg AS (

    SELECT

    oid,

    COUNT(*) AS oCount

    FROM

    #orders

    GROUP BY

    oid

    )

    --select * from orderagg

    SELECT

    cte.*

    FROM

    cte

    JOIN

    orderAgg AS oa1

    ONcte.oid1 = oa1.oid

    JOIN

    orderAgg AS oa2

    ONcte.oid2 = oa2.oid

    WHERE

    cte.cntForPair = oa1.oCount

    AND cte.cntForPair = oa2.oCount

    AND cte.oid1 < cte.oid2

    EDIT: This has been going longer then I expected it, so I edited my original answer post to include the ORDER BY to avoid issues.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • See this topic. We hammered the subjet to death...

    http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/c4d4f72c-35a7-43f9-8dbc-5517a14de5f2

    Using Adventureworks

    SELECTx.PurchaseOrderID AS LowID,

    y.PurchaseOrderID AS HighID,

    MIN(x.ProductItems) AS ProductItems

    INTO#Stage

    FROM(

    SELECTPurchaseOrderID,

    ProductID,

    COUNT(*) OVER (PARTITION BY PurchaseOrderID) AS ProductItems

    FROMPurchasing.PurchaseOrderDetail

    ) AS x

    INNER JOIN(

    SELECTPurchaseOrderID,

    ProductID,

    COUNT(*) OVER (PARTITION BY PurchaseOrderID) AS ProductItems

    FROMPurchasing.PurchaseOrderDetail

    ) AS y ON y.ProductID = x.ProductID

    WHEREx.PurchaseOrderID < y.PurchaseOrderID

    GROUP BYx.PurchaseOrderID,

    y.PurchaseOrderID

    HAVINGCOUNT(*) = MIN(x.ProductItems)

    AND COUNT(*) = MIN(y.ProductItems)

    CREATE INDEX IX_Low ON #Stage (LowID) INCLUDE (HighID, ProductItems)

    CREATE INDEX IX_High ON #Stage (HighID) INCLUDE (LowID, ProductItems)

    ;WITH cteDuplicate

    AS (

    SELECTs.LowID AS theGrp,

    s.LowID AS PurchaseOrderID,

    MIN(s.ProductItems) AS ProductItems

    ,1 S SeqIDFROM#Stage AS s

    WHERENOT EXISTS (SELECT * FROM #Stage AS x WHERE x.HighID = s.LowID)

    GROUP BYs.LowID

    UNION ALL

    SELECTd.theGrp,

    f.ID AS PurchaseOrderID,

    d.ProductItems

    ,d.SeqID + 1 AS SeqIDFROMcteDuplicate AS d

    CROSS APPLY(

    SELECTs.HighID,

    ROW_NUMBER() OVER (ORDER BY s.HighID) AS SeqID

    FROM#Stage AS s

    WHEREs.LowID = d.PurchaseOrderID

    ) AS f(ID, SeqID)

    WHEREf.SeqID = 1

    )

    SELECTDENSE_RANK() OVER (ORDER BY theGrp) AS theGroup,

    ProductItems,

    SeqID,

    COUNT(*) OVER (PARTITION BY theGrp) AS PurchaseOrderItems,

    PurchaseOrderID

    INTO#Temp

    FROMcteDuplicate

    OPTION(MAXRECURSION 0)

    DROP TABLE#Stage

    SELECTw.ProductItems,

    STUFF(w.Data, 1, 2, '') AS Products,

    w.PurchaseOrderItems,

    STUFF(f.Data, 1, 2, '') AS PurchaseOrders

    FROM(

    SELECTt.theGroup,

    t.ProductItems,

    (

    SELECT', ' + CAST(pod.ProductID AS VARCHAR(12))

    FROMPurchasing.PurchaseOrderDetail AS pod

    WHEREpod.PurchaseOrderID = t.PurchaseOrderID

    ORDER BYpod.ProductID

    FOR XMLPATH('')

    ) AS Data,

    t.PurchaseOrderItems

    FROM#Temp AS t

    WHEREt.SeqID = 1

    ) AS w

    CROSS APPLY(

    SELECT', ' + CAST(t.PurchaseOrderID AS VARCHAR(12))

    FROM#Temp AS t

    WHEREt.theGroup = w.theGroup

    ORDER BYt.PurchaseOrderID

    FOR XMLPATH('')

    ) AS f(Data)

    ORDER BYw.ProductItems,

    w.PurchaseOrderItems

    DROP TABLE#Temp


    N 56°04'39.16"
    E 12°55'05.25"

  • Here's a basic idea that works with the sample data that was supplied...

    CREATE FUNCTION dbo.fnOrders

    (

    @oid int

    )

    RETURNS nvarchar(4000)

    AS BEGIN

    DECLARE @orders_out nvarchar(4000)

    SELECT @orders_out = ISNULL(@orders_out,'') + ISNULL(CAST(itemid AS nvarchar),'') + ';' FROM orders WHERE oid = @oid

    RETURN @orders_out

    END

    GO

    ...and then use that function in this query:

    SELECT DISTINCT

    oid,

    dbo.fnOrders(oid) AS orders_out

    INTO

    #temp

    FROM

    orders

    SELECT

    oid,

    orders_out

    FROM

    #temp

    WHERE

    orders_out IN

    (

    SELECT orders_out

    FROM #temp

    GROUP BY orders_out

    HAVING COUNT(*) > 1

    )

    DROP TABLE #temp

  • Here's another version using APPLY without any sneaky tricks.

    DROP table #orders

    create table #orders(

    oid int,

    itemid int

    )

    insert into #orders (oid, itemid)

    values (1,1111)

    insert into #orders (oid, itemid)

    values (1,2222)

    insert into #orders (oid, itemid)

    values (1,3333)

    insert into #orders (oid, itemid)

    values (2,1111)

    insert into #orders (oid, itemid)

    values (2,2222)

    insert into #orders (oid, itemid)

    values (3,1111)

    insert into #orders (oid, itemid)

    values (3,2232)

    insert into #orders (oid, itemid)

    values (3,3333)

    insert into #orders (oid, itemid)

    values (4,2222)

    insert into #orders (oid, itemid)

    values (4,4444)

    insert into #orders (oid, itemid)

    values (4,5555)

    insert into #orders (oid, itemid)

    values (4,6666)

    insert into #orders (oid, itemid)

    values (5,1111)

    insert into #orders (oid, itemid)

    values (5,2232)

    insert into #orders (oid, itemid)

    values (5,3333)

    SELECT oid, MatchingOid

    FROM (

    SELECT o1.oid, o1.itemid, o1.ItemRows, MatchingOid = o2.oid, o3.ItemMatch

    FROM (SELECT oid, itemid, ItemRows = COUNT(*) OVER (PARTITION BY oid) FROM #orders) o1

    INNER JOIN (SELECT oid, ItemRows = COUNT(*) FROM #orders GROUP BY oid) o2 ON o1.ItemRows = o2.ItemRows AND o1.oid <> o2.oid

    CROSS APPLY (SELECT ItemMatch = COUNT(*) FROM #orders WHERE oid = o2.oid AND itemid = o1.itemid) o3

    ) d

    GROUP BY oid, MatchingOid

    HAVING MAX(ItemRows) = SUM(ItemMatch)

    Run the inner select to see how it works.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home (5/3/2011)


    Here's another version using APPLY without any sneaky tricks.

    Chris, run an estimated execution plan of your code against the following 10,000 rows of data and look at the row counts for the arrows...

    /****************************************************************************************

    Purpose:

    Create a voluminous test table with various types of highly randomized data.

    --Jeff Moden

    ****************************************************************************************/

    --===== Conditionally drop the test table to make reruns easier

    IF OBJECT_ID('tempdb..#orders','U') IS NOT NULL

    DROP TABLE #orders

    ;

    --===== Create and populate a 10,000 row test table.

    -- "SomeID" has a range of 1 to 10,000 unique numbers

    -- "OID" has a range of 1 to 100,000 non-unique numbers

    -- "ItemID" has a range of 1 to 1000

    SELECT TOP 10000

    SomeID = IDENTITY(INT,1,1),

    OID = ABS(CHECKSUM(NEWID()))%100000+1,

    ItemID = ABS(CHECKSUM(NEWID()))%1000+1

    INTO #orders

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    --===== Add a typical Primary Key.

    ALTER TABLE #orders

    ADD PRIMARY KEY CLUSTERED (SomeID)

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • tfifield (5/2/2011)


    Todd,

    This should give you what you want. It gives each order and its items for only orders that have the same items.

    ; WITH CTEItems AS

    ( SELECT oid, itemid

    , COUNT(*) OVER (PARTITION BY oid) AS ItemCount

    FROM orders

    )

    --SELECT * FROM CTEItems

    SELECT C1.oid, C1.itemid

    FROM CTEItems C1

    INNER JOIN CTEItems C2 ON

    C1.ItemCount = C2.ItemCount

    AND C1.itemid = C2.itemid

    AND C1.oid <> C2.oid

    Todd Fifield

    Try it with more data and an ORDER BY ItemId, OID and see what happens when you use the following data, Todd.

    /****************************************************************************************

    Purpose:

    Create a voluminous test table with various types of highly randomized data.

    --Jeff Moden

    ****************************************************************************************/

    --===== Conditionally drop the test table to make reruns easier

    IF OBJECT_ID('orders','U') IS NOT NULL

    DROP TABLE orders

    ;

    --===== Create and populate a 10,000 row test table.

    -- "SomeID" has a range of 1 to 10,000 unique numbers

    -- "OID" has a range of 1 to 100,000 non-unique numbers

    -- "ItemID" has a range of 1 to 1000

    SELECT TOP 10000

    SomeID = IDENTITY(INT,1,1),

    OID = ABS(CHECKSUM(NEWID()))%100000+1,

    ItemID = ABS(CHECKSUM(NEWID()))%1000+1

    INTO orders

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    --===== Add a typical Primary Key.

    ALTER TABLE orders

    ADD PRIMARY KEY CLUSTERED (SomeID)

    ;

    Yes, I agree. A DISTINCT will fix that. Then, try it against just a million rows and see what happens. 😛 Here's the code to gen a million rows...

    /****************************************************************************************

    Purpose:

    Create a voluminous test table with various types of highly randomized data.

    --Jeff Moden

    ****************************************************************************************/

    --===== Conditionally drop the test table to make reruns easier

    IF OBJECT_ID('orders','U') IS NOT NULL

    DROP TABLE orders

    ;

    --===== Create and populate a 1,000,000 row test table.

    -- "SomeID" has a range of 1 to 1,000,000 unique numbers

    -- "OID" has a range of 1 to 100,000 non-unique numbers

    -- "ItemID" has a range of 1 to 1000

    SELECT TOP 1000000

    SomeID = IDENTITY(INT,1,1),

    OID = ABS(CHECKSUM(NEWID()))%100000+1,

    ItemID = ABS(CHECKSUM(NEWID()))%1000+1

    INTO orders

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    --===== Add a typical Primary Key.

    ALTER TABLE orders

    ADD PRIMARY KEY CLUSTERED (SomeID)

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply