Cant see the wood for the trees ! ...help on query please

  • "trying to sort some "ugly" data from 3rd pty provider"

    Hi..I was wondering if anyone could assist me in sorting the following code please...I need solutions for SQL 2000 but would be interested to see if there are better alternatives in 2005.

    (all code required is posted below to assist with testing)

    I think I may have spent far too long trying to sort this problem out and now cant “see the wood for the trees”...maybe some either eyes may see it entirely differently

    I was not sure how to approach this and am keen to learn ...please be gentle 🙂 and any advice will be gratefully received.

    The table structure is from a 3rd pty app and I import the data into SQL.

    Given the following table structure

    [ItemID] [int]

    [PLNO] [int]

    [StartDate] [datetime]

    [EndDate] [datetime]

    [DateEntered] [datetime]

    [EnteredID] [int]

    [Cost] [decimal](9, 4)

    I need to extract a Distinct ItemId and Cost, where

    StartDate <= Today
    EndDate >= Today

    Max( date entered)

    Max( EnteredID)

    Max(StartDate)

    The results should be

    ItemIdCost

    1230.90

    32112.72

    99918.28

    There is no primary key and duplicate entries are allowed....you may well ask why the data is like it is....so did I but only received an arrogant shoulder shrug

    The code I have gives acceptable performance ...extracting 50K rows from 1M+ in around 10 secs on an ageing SQL2000 box

    But when I have a record such as this it doesnt return expected results:

    SELECT '321', '3','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','2.5000' UNION ALL

    (this line is commented in the code below)

    Results are:

    ItemIdCost

    1230.90

    3212.50

    32112.72

    99918.28

    The row in bold (321/2.50) should NOT return becasue PLNO = 3....

    Has anyone got any ideas please....?

    Also are there any better methods in SQL2005?

    Kind regards Graham

    Note:

    SELECT '999','4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50336','18.2800' UNION ALL

    SELECT '999','4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50336','18.2800' UNION ALL

    The above are duplicates and ARE intended .. I need to extract one row

    SELECT '123', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','0.0000' UNION ALL

    SELECT '123', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','0.9000' UNION ALL

    From the above I need to select the second row becasue it has a later StartingDate

    USE [tempdb]

    GO

    --- If table(s) already exists, drop

    IF OBJECT_ID('tempdb..#phexample', 'U') IS NOT NULL

    DROP TABLE #phexample

    IF OBJECT_ID('tempdb..#table1', 'U') IS NOT NULL

    DROP TABLE #table1

    IF OBJECT_ID('tempdb..#table2', 'U') IS NOT NULL

    DROP TABLE #table2

    CREATE TABLE [#phexample]

    (

    [ItemID] [int] NOT NULL

    , [PLNO] [int] NOT NULL

    , [StartDate] [datetime] NOT NULL

    , [EndDate] [datetime] NOT NULL

    , [DateEntered] [datetime] NOT NULL

    , [EnteredID] [int] NOT NULL

    , [Cost] [decimal](9, 4) NOT NULL

    )

    ON [PRIMARY]

    GO

    --- Insert the test data into the test table(s)

    INSERT INTO #phexample (

    ItemID,

    PLNO,

    StartDate,

    EndDate,

    DateEntered,

    EnteredID,

    Cost

    )

    SELECT '123', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Apr 1 2009 12:00AM','46234','0.4300' UNION ALL

    SELECT '123', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','46014','0.4400' UNION ALL

    SELECT '123', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51707','0.6500' UNION ALL

    SELECT '123', '4','May 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','0.7200' UNION ALL

    SELECT '123', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','0.0000' UNION ALL

    SELECT '123', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','0.9000' UNION ALL

    SELECT '123', '4','Aug 1 2010 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50251','0.5000' UNION ALL

    SELECT '123', '3','May 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','792','0.5000' UNION ALL

    SELECT '123', '3','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','792','0.0000' UNION ALL

    SELECT '123', '3','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','792','0.9000' UNION ALL

    SELECT '123', '3','Aug 1 2010 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','251','0.5000' UNION ALL

    SELECT '999','4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50336','18.2800' UNION ALL

    SELECT '999','4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50336','18.2800' UNION ALL

    SELECT '321', '4','Jan 1 2008 12:00AM','Dec 31 2008 12:00AM','Nov 1 2007 12:00AM','7460','8.7200' UNION ALL

    SELECT '321', '4','May 1 2008 12:00AM','Dec 31 2008 12:00AM','May 1 2008 12:00AM','34012','9.0900' UNION ALL

    SELECT '321', '4','Jun 1 2008 12:00AM','Dec 31 2008 12:00AM','May 1 2008 12:00AM','13801','11.3200' UNION ALL

    SELECT '321', '4','Jan 1 2008 12:00AM','Dec 31 2010 12:00AM','Nov 1 2008 12:00AM','13801','10.0000' UNION ALL

    SELECT '321', '4','Jun 1 2008 12:00AM','Dec 31 2010 12:00AM','May 1 2008 12:00AM','34010','11.0600' UNION ALL

    SELECT '321', '4','Jan 1 2009 12:00AM','Dec 31 2010 12:00AM','Oct 1 2008 12:00AM','1956','12.6600' UNION ALL

    SELECT '321', '4','Jan 1 2009 12:00AM','Dec 31 2010 12:00AM','Oct 1 2008 12:00AM','17343','12.6600' UNION ALL

    SELECT '321', '4','May 1 2008 12:00AM','Dec 31 2010 12:00AM','Oct 1 2008 12:00AM','8144','8.9600' UNION ALL

    SELECT '321', '4','Jun 1 2008 12:00AM','Dec 31 2010 12:00AM','Nov 1 2008 12:00AM','34010','11.0600' UNION ALL

    SELECT '321', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Apr 1 2009 12:00AM','46234','10.4300' UNION ALL

    SELECT '321', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Apr 1 2009 12:00AM','46234','10.4300' UNION ALL

    SELECT '321', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','46014','12.4400' UNION ALL

    SELECT '321', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51707','12.6500' UNION ALL

    SELECT '321', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','12.7200' UNION ALL

    SELECT '321', '3','Jun 3 2009 12:00AM','Dec 31 2010 12:00AM','Oct 1 2008 12:00AM','400','2.6600' UNION ALL

    SELECT '321', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','12.0000' UNION ALL

    SELECT '321', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','12.7200' UNION ALL

    ---SELECT '321', '3','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','2.5000' UNION ALL ---this causes problems

    SELECT '321', '4','Aug 1 2010 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50251','13.0000'

    SELECT DISTINCT --- extract results into temp table ---removing dupes with DISTINCT

    #phexample.ItemID

    , #phexample.Cost

    , #phexample.StartDate

    INTO [#table1]

    FROM ( SELECT #phexample.ItemID

    , #phexample.DateEntered

    , MAX(#phexample.EnteredID) AS MaxEnteredID -- find max EnteredID from records below

    FROM ( SELECT MAX(DateEntered) AS MaxDateEntered ---- find last date entered from records below

    , ItemID

    FROM ( SELECT ItemID

    , DateEntered

    , EnteredID

    FROM #phexample

    WHERE ( StartDate <= GETDATE() ) ----- limit records by todays date
    AND ( EndDate >= GETDATE() )

    AND PLNO = 4

    ) DT1

    GROUP BY ItemID

    ) DT2

    INNER JOIN #phexample

    ON DT2.ItemID = #phexample.ItemID

    AND DT2.MaxDateEntered = #phexample.DateEntered

    GROUP BY #phexample.ItemID

    , #phexample.DateEntered

    ) DT3

    INNER JOIN #phexample

    ON DT3.ItemID = #phexample.ItemID

    AND DT3.DateEntered = #phexample.DateEntered

    AND DT3.MaxEnteredID = #phexample.EnteredID

    GO

    SELECT ItemID

    , MAX(StartDate) AS MaxStartDate ---find latest start date from records extacted above

    INTO [#table2]

    FROM #table1

    GROUP BY ItemID

    GO

    SELECT #table1.ItemID

    , #table1.Cost

    FROM #table1

    INNER JOIN #table2

    ON #table1.ItemID = #table2.ItemID

    AND #table1.StartDate = #table2.MaxStartDate

    ---END

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • First of all, I must say (bowing deeply with gratitude) thank you for all the setup data. It makes life real easy when trying to help someone because we can concentrate on the problem instead of asking a thousand questions about the data and the underlying tables. Well done.

    You certainly have the right idea in isolating just the data you want to work with instead of working with a monster table. I did similar in the following. I tested it with the "troublesome row" uncommented and it produces the answer you were looking for. If I had to do the same to a million rows instead of a small subset of a million rows, I might try a different 2k (quirky update) method depending on performance. Here's a 2k solution...

    --===== Do a little "Divide'n'Conquer" by isolating

    -- just the rows we need to work with.

    IF OBJECT_ID('TempDB..#WorkTable','U') IS NOT NULL

    DROP TABLE #WorkTable

    ;

    SELECT DISTINCT

    ItemID,

    PLNO,

    StartDate,

    DateEntered,

    EnteredID,

    Cost

    INTO #WorkTable

    FROM #phexample

    WHERE StartDate = GETDATE()

    ;

    --===== Solve the problem using SQL Server 2000 style code

    -- (ie. ROW_NUMBER is not available). No longer any

    -- need for date comparisons because only the correct

    -- rows are in the work table.

    SELECT wt.ItemID, wt.Cost

    FROM

    (--==== Find the MAX PLNO for each ItemID.

    -- Shouldn't find MAX dates here because a

    -- lower PLNO may have a larger date.

    SELECT ItemID,

    MAX(PLNO) AS MaxPLNO

    FROM #WorkTable

    GROUP BY ItemID

    ) mp

    INNER JOIN

    (--==== Ok, Find the MAX dates and ID's for each

    -- ItemID/PLNO combo

    SELECT ItemID,

    PLNO,

    MAX(DateEntered) AS MaxDateEntered,

    MAX(EnteredID) AS MaxEnteredID,

    MAX(StartDate) AS MaxStartDate

    FROM #WorkTable

    GROUP BY ItemID, PLNO

    ) msd

    ON mp.ItemID = msd.ItemID

    AND mp.MaxPLNO = msd.PLNO

    INNER JOIN #WorkTable wt --This join puts it all together

    ON msd.ItemID = wt.ItemID

    AND msd.PLNO = wt.PLNO

    AND msd.MaxDateEntered = wt.DateEntered

    AND msd.MaxEnteredID = wt.EnteredID

    AND msd.MaxStartDate = wt.StartDate

    ;

    The same problem in 2k5 has a much easier solution thanks to ROW_NUMBER...

    WITH cteSequenced AS

    (

    SELECT ROW_NUMBER() OVER (PARTITION BY ItemID

    ORDER BY ItemID DESC,

    PLNO DESC,

    DateEntered DESC,

    EnteredID DESC,

    StartDate DESC)

    AS Sequence,

    ItemID,

    Cost

    FROM #phexample

    WHERE StartDate = GETDATE()

    )

    SELECT s.ItemID, s.Cost

    FROM cteSequenced s

    WHERE s.Sequence = 1

    Heh... ya just gotta love the power they built into ROW_NUMBER. Let me know if you have any questions...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Mr Moden...first of all thank you for your generous comments on the set up data...:blush:...It did take some effort to create and double check before posting but as your article suggests...you certainly get better and faster response...many thanks to you Sir.

    Have tested your code and I think that I didnt make myself clear enough that I only required PLNO =4 (it was in the code but not in the post comments and notes...apologies)

    if I change the commented code to:

    SELECT '321', '5','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2010 12:00AM','51792','2.5000' UNION ALL ---...it was giving me incorrect answer.

    So I added PLN0 = 4 to the initial SELECT DISTINCT and removed the "mp" SELECT...this all appears to work as expected...still to test in anger on production though.

    --===== Do a little "Divide'n'Conquer" by isolating

    -- just the rows we need to work with.

    IF OBJECT_ID('TempDB..#WorkTable','U') IS NOT NULL

    DROP TABLE #WorkTable

    ;

    SELECT DISTINCT

    ItemID,

    PLNO,

    StartDate,

    DateEntered,

    EnteredID,

    Cost

    INTO #WorkTable

    FROM #phexample

    WHERE StartDate = GETDATE()

    --- filter for PLNO will cut down on rows

    AND PLNO = 4

    ;

    --===== Solve the problem using SQL Server 2000 style code

    -- (ie. ROW_NUMBER is not available). No longer any

    -- need for date comparisons because only the correct

    -- rows are in the work table.

    SELECT wt.ItemID, wt.Cost

    FROM

    -- (--==== Find the MAX PLNO for each ItemID.

    -- -- Shouldn't find MAX dates here because a

    -- -- lower PLNO may have a larger date.

    -- SELECT ItemID,

    -- MAX(PLNO) AS MaxPLNO

    -- FROM #WorkTable

    -- GROUP BY ItemID

    -- ) mp

    -- INNER JOIN

    (--==== Ok, Find the MAX dates and ID's for each

    -- ItemID/PLNO combo

    SELECT ItemID,

    -- PLNO,

    MAX(DateEntered) AS MaxDateEntered,

    MAX(EnteredID) AS MaxEnteredID,

    MAX(StartDate) AS MaxStartDate

    FROM #WorkTable

    GROUP BY ItemID --, PLNO

    ) msd

    -- ON mp.ItemID = msd.ItemID

    -- AND mp.MaxPLNO = msd.PLNO

    INNER JOIN #WorkTable wt --This join puts it all together

    ON msd.ItemID = wt.ItemID

    -- AND msd.PLNO = wt.PLNO

    AND msd.MaxDateEntered = wt.DateEntered

    AND msd.MaxEnteredID = wt.EnteredID

    AND msd.MaxStartDate = wt.StartDate

    ;

    A quick test on initial SELECT DISTINCT returns 55K rows from 1M+ table in approx 2secs. on production box at work,..so not poor at all

    I have separate index on each column...is this preferable or would it be better to have one index that covers all cols?...dont worry I can always test 🙂

    Update....hmmm just found another issue with the data... entries with all columns identical only the cost is different !!!

    suppose I will have to make an executive decision on which is correct :hehe:

    I havent had time yet to play with 2005 code...but will do later on with a full set of data....looks powerful.

    Once again, sincere thanks for your reply...and I did learn a new technique that will assist me in several other areas..creating all the MAX fields in one go and linking back to the original data...I always seemed to go step by step and eventually lose my way :crying:

    Kind regards Graham

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Heh... "The devil's in the data", I always say. Sorry I missed the thing on PLNO = 4. I didn't spend much time with your code other than to get the results.

    On the cost thing... My executive decision would be to show both the lowest and highest costs in the case of ties and let someone else make the decision. 😛

    And thanks for posting the changes you made. Considering that there's just a single PLNO that you want to check, the changes do look correct. Thanks for the feedback, Graham.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sorry... almost forgot... I suspect that the indexing you have will suffice although a covering index on just the columns in the first copy to the working table would actually derive a bit more performance.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • On the cost thing... My executive decision would be to show both the lowest and highest costs in the case of ties and let someone else make the decision. 😛

    .

    Heh..can just imagine the conversation:

    Very Important Person...” what the *** is going on here.. why have I got two sets of figures?”

    gah...”well...your team have entered two costs and I dont know which is correct..can you tell me please?”

    VIP...”NO...you are the IT GIT...sort it!”

    gah..”hmmm....perhaps you could ask your team to double check before inputting the data?”

    VIP...”NO...they are FAR too busy to do that...you are the IT GIT...sort it!”

    gah ...” well...we could ask the Application Vendor to amend their program to prevent such errors”

    VIP...” you are the IT GIT...sort it!”

    gah...” here is the capex form for your signature”

    VIP...”How much !!!!!...you are the IT GIT..did you recommend this software?”

    gah..”No ...Mr Very Important Person Sir...you did

    :w00t::w00t::w00t:

    more seriously...thanks for 2k5 code...I LIKE A LOT !!!!

    had been looking at ROW_NUMBER() this afternoon referring to another of your posts today...but could not seem to grasp the concept until I saw ORDER BY...DESC

    regards gah

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Heh... I like the last line of that conversation except that's normally where I break out the pork chops with the bone in them. 😛 Getting back to the descision on that... there's always AVG. :hehe: Actually, is there a way to get the entry time on the date entered column? If you can't change the program to do it, an "override" trigger would do it.

    Yeah... I agree... ROW_NUMBER is the berries. It makes life pretty simple. Now, if they could just make the windowed aggregate functions like SUM() OVER work correctly.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Getting back to the descision on that... there's always AVG. :hehe: Actually, is there a way to get the entry time on the date entered column? If you can't change the program to do it, an "override" trigger would do it.

    The data and app are not ours...its a 3rd pty software house on a Progress 9.1E db....I only rip and strip the data into SQL for reporting...therefore can only work with what is presented via ODBC...and only way to add triggers/cols etc is by throwing money at them...:crazy:

    Actually the entry time is already there (EnteredID) as number of secs past midnight !!

    This probably worked ok when the data entry was manual...no one could possibly enter two records within a single second. However they then allowed users to import data directly from excel spreadsheets...complete with errors and duplicates and little data validation....so now we can have hundreds of imported records with same date/time :rolleyes:

    As for AVG...nice idea, but I think it will cause more problems because any end user drilling into one my reports will no longer be able to xref to the cost data in the app.

    Thanks for your interest and inputs...it has been very much appreciated.

    Kind regards gah

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply