preventing duplication in inner join

  • I have two tables that I want to join: One that contains items for sale, and one that contains items to buy. I want to join these tables on a common ProductID number column that identifies the product for sale or purchase. Here is a sample query:

    SELECT * FROM BuyListings INNER JOIN SellListings ON BuyListings.ProductID=SellListings.ProductID

    For any given product, there are a certain number of items for sale, and a certain number of items to buy. I want to uniquely match each sale item from the SellListings table to an item in the BuyListings table.

    SAMPLE DATA:

    BuyListings Table

    BuyListingID    ProductID

    1                  1

    2                  1

    3                  1

    4                  2

    5                  3

    SellListings Table

    SellListingID     ProductID

    10                  1

    11                  1

    12                  3

    12                  3

    Desired Result Set

    BuyListingID     SellListingID        ProductID

    1                    10                    1

    2                    11                    1

    5                    12                    3

    To clarify, for each ProductID, if there is both a BuyListing and SellListing available, I want to join them in the result set.  I want to remove that BuyListing and SellListing so they can't be put into the result set again. Then if there is a BuyListing and SellListing still available, repeat the process. If (for a single ProductID) there are more BuyListings then SellListings (or vice versa) the extras are ignored.

    Notice that BuyListingID 2 does not join to SellListingID 10 because it is already used in the result set. Also, BuyListingID 3 does not join to SellListingID 10 or 11, because there are already BuyListings joined to those SellListings.

  • SELECT DISTINCT BuyListingID, SellListingID, ProductID 

    FROM BuyListings INNER JOIN SellListings ON BuyListings.ProductID=SellListings.ProductID

    HTH

     


    * Noel

  • Gee....it's that simple?

    Thanks for the help!

  • Uh-oh, that didn't work. Because DISTINCT only eliminates duplicate rows, I still get the following result set (based on the sample data above):

    BuyListingID   SellListingID    ProductID

    1                 10                 1

    1                 11                 1

    2                 10                 1

    2                 11                 1

    3                 10                 1

    3                 11                 1

    5                 12                 3

    5                 13                 3

  • pls give schema and some records so that we can make out what ur asking for

    <a href="http://www.websolsoftware.com"> For IT jobs click here</a>

    *Sukhoi*[font="Arial Narrow"][/font]

  • how are buyinglisting and selllisting dependant on each other ?

    <a href="http://www.websolsoftware.com"> For IT jobs click here</a>

    *Sukhoi*[font="Arial Narrow"][/font]

  • I suspect you'll need to use a cursor to accomplish this in sql since ( if I've understood what you want :S ) you need to continually monitor the results set. Looks like something that might be better done in code, but o/w a cursor and a temp table should suffice.

    jon

  • Hi!

    Notice that BuyListingID 2 does not join to SellListingID 10 because it is already used in the result set. Also, BuyListingID 3 does not join to SellListingID 10 or 11, because there are already BuyListings joined to those SellListings

    What you want is simple,

    It is not INNER JOIN but combination EXISTS (semi-join) with TOP 1 :

    SELECT

        BL.BuyListingId

        ,SellListingId = (

          SELECT TOP 1

            SellListingId FROM SellListings WHERE ProductId = BL.ProductId

        )

        ,ProductId

      FROM

        BuyListings BL

      WHERE

        EXISTS(

          SELECT * FROM SellListings WHERE ProductId = BL.ProductId

        )

    I can't see a practical purpose in such statement - but get!

    Good Luck!

     

  • Cursor is IMHO not necessary... and if the BuyListingID and SellListingID would always be precisely as they are in the example - meaning that the ID is identity column with increment 1 and all listings of a certain product follow immediately one after another - then it could be doable even without temporary tables. However, if it can happen, that buy listing with ID 1 is for product A, ID 2 for product B and ID 3 again for product A, then temporary table is needed, to order the tables by product and assign sequence numbers correctly - if this is the case, we can discuss it later, because solution is basically the same, just with pre-ordering the rows. BTW, I suppose, that the 2 identical rows in table SellListings are typing error and SellListingID really identifies the row.

    Solution is in marking each line with certain product (in both tables) with numbers from 1 to x and then join both tables on 2 columns - productID, and this additional Sequence number column.

    To get the right result from supplied values, following query works fine:

    SELECT BuyListingID, SellListingID, buy.productID

    FROM

    (SELECT *, BuyListingID + 1 - (select min(BuyListingID) from BuyListings where productID = bl.productID) AS seq_no

    FROM BuyListings bl) AS buy

    JOIN

    (SELECT *, SellListingID + 1 - (select min(SellListingID) from SellListings where productID = bl.productID) AS seq_no

    FROM SellListings bl) AS sell

    ON buy.productID = sell.productID AND buy.seq_no = sell.seq_no

    HTH, Vladan

    PS: I am aware that this query is far from perfect and most probably would need some ORDER BY clauses in the subqueries... but I'm at work, so my time is limited and I just wanted to show the principial idea behind solution quickly, using supplied sample. We can look at the problem in greater detail later, when we know more about the real structure of data.

  • I have a similar task, where I'm matching two lists and want the best matches but can have only one unique match for an item from either list.  You don't say whether you have any criteria to choose one match over another, but I will assume there is something (most recent posting, sale/asking price?) that you can put into an expression for ranking.

    Create a temp table, create unique indexes on BuyListingID and SellListingID using WITH IGNORE_DUP_KEY.  Join the two tables to enumerate every possible match, and insert the results in the temp table ordered by match score.  Only row will be saved for any BuyListingID or SellListingID, the others are ignored.  I've found this technique to be much simpler than trying to create a query to identify and delete all the duplicates.

    CREATE TABLE #Results (

       BuyListingID INT NOT NULL,

       SellListingID INT NOT NULL,

       ProductID INT NOT NULL)

    CREATE UNIQUE NONCLUSTERED INDEX IX_Results_Buy ON #Results (BuyListingID) WITH IGNORE_DUP_KEY

    CREATE UNIQUE NONCLUSTERED INDEX IX_Results_Sell ON #Results (SellListingID) WITH IGNORE_DUP_KEY

    INSERT INTO #Results (BuyListingID, SellListingID, ProductID)

    SELECT BuyListingID, SellListingID, b.ProductID

    FROM BuyListings b

    INNER JOIN SellListings s ON b.ProductID = s.ProductID

    ORDER BY <ranking expression>

    You'll get a warning message about duplicates ignored, but the results will be what you're looking for.

  • Hello,

    I got to look at the problem again and realized, that - if I understand everything correctly - this query should work even if the products are not ordered, as I mentioned with the first solution... and there is no need for temporary tables, inserts and similar.

    SELECT BuyListingID, SellListingID, buy.productID

    FROM

    (SELECT *, (select count(*) from BuyListings where productID = bl.productID AND BuyListingID <= bl.BuyListingID) AS seq_no FROM BuyListings bl) AS buy

    JOIN

    (SELECT *, (select count(*) from SellListings where productID = sl.productID AND SellListingID <= sl.SellListingID) AS seq_no FROM SellListings sl) AS sell

    ON buy.productID = sell.productID AND buy.seq_no = sell.seq_no

    This query should always return correct results... e.g. with this example, where the first one fails:

    create table BuyListings (BuyListingID int, ProductID int)

    insert into BuyListings values (1,1)

    insert into BuyListings values (2,2)

    insert into BuyListings values (3,1)

    insert into BuyListings values (4,1)

    insert into BuyListings values (5,3)

    insert into BuyListings values (6,3)

    create table SellListings (SellListingID int, ProductID int)

    insert into SellListings values (10,1)

    insert into SellListings values (11,1)

    insert into SellListings values (12,3)

    Result:

    BuyListingID SellListingID productID  

    ------------ ------------- -----------

               1            10           1

               3            11           1

               5            12           3

    I also apologize for the confusing aliases in my previous attempt - it worked OK, as they were in different select statements, but giving identical alias to two different tables within one complex query is not the best practice :-).

  • if u could clarify the other way that u need

    it based on product id it would have been simpler to understand

    productid buylistingid selllistingid

     

    becoz u say other way problem looks confused

     

     

    <a href="http://www.websolsoftware.com"> For IT jobs click here</a>

    *Sukhoi*[font="Arial Narrow"][/font]

  • I agree with Vladan's approach.  For a more complex case, however, the ListingID's won't be in

    continuous sequence.  For this technique to work, however, they at least have to unique in each table. 

    1A. The first intermediate result calculates the sequence number for each BuyListing.

    You can save this in a #temp table, or use the SQL as a sub-query (see below step #3).

      SELECT ROWB.BuyListingID, ROWB.ProductID, COUNT(ALLB.BuyListingID) AS seq_no

      FROM BuyListings AS ROWB,

           BuyListings AS ALLB

      WHERE ALLB.ProductID = ROWB.BuyListingID

        AND ALLB.BuyListingID <= ROWB.BuyListingID

      GROUP BY ROWB.ProductID, ROWB.BuyListingID

    1B. Intermediate result set #1:

      BuyListingID    ProductID seq_no

      1                  1        1

      2                  1        2

      3                  1        3

      4                  2        1

      5                  3        1

    2A. Repeat exactly for the SellListing table:

     

      SELECT ROWS.SellListingID, ROWS.ProductID, COUNT(ALLS.SellListingID) AS seq_no

      FROM SellListings AS ROWS,

           SellListings AS ALLS

      WHERE ALLS.ProductID = ROWS.SellListingID

        AND ALLS.SellListingID <= ROWS.SellListingID

      GROUP BY ROWS.ProductID, ROWS.SellListingID

    2B. Intermediate result set #2:

      SellListingID    ProductID seq_no

      10                  1       1

      11                  1       2

      12                  3       1

    3A. Now join the intermediate result sets on ProductID and seq_no:

     

      SELECT INTB.BuyListingID, INTS.SellListingID, INTB.ProductID, INTB.seq_no

      FROM <IntermediateBuyResult> AS INTB

           <IntermediateSellResult> AS INTS

      WHERE INTB.ProductID = INTS.ProductID

        AND INTB.seq_no = INTS.seq_no

    3B. Final result set:

      BuyListingID     SellListingID        ProductID seq_no

      1                    10                    1      1

      2                    11                    1      2

      5                    12                    3      1


    Regards,

    Bob Monahon

  • Hi Bob,

    I agree absolutely, and in fact I already used almost the same SQL in my second post :-). I just didn't have enough time when I was writing the first one, so I decided to hit the basics only and return to this question later. Of course, if the Listing ID's are not unique, then it is hard to make any query work... Anyway, I'm glad that you came up with the same solution independently; it makes me more confident, that it really works

  • Vladan and Bob have a great solution, unless:

    The listing ID's are not unique

    You want to use more complicated criteria to choose the best match

    You have very large tables

    The solution I posted using unique indexes works nicely with 50+ million row tables, but if I tried something with multiple self-joins my co-workers would string me up before I lived long enough to see the results.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply