August 9, 2009 at 6:58 am
"trying to sort some "ugly" data from 3rd pty provider"
Hi..I was wondering if anyone could assist me in sorting the following code please...I need solutions for SQL 2000 but would be interested to see if there are better alternatives in 2005.
(all code required is posted below to assist with testing)
I think I may have spent far too long trying to sort this problem out and now cant “see the wood for the trees”...maybe some either eyes may see it entirely differently
I was not sure how to approach this and am keen to learn ...please be gentle 🙂 and any advice will be gratefully received.
The table structure is from a 3rd pty app and I import the data into SQL.
Given the following table structure
[ItemID] [int]
[PLNO] [int]
[StartDate] [datetime]
[EndDate] [datetime]
[DateEntered] [datetime]
[EnteredID] [int]
[Cost] [decimal](9, 4)
I need to extract a Distinct ItemId and Cost, where
StartDate <= Today
EndDate >= Today
Max( date entered)
Max( EnteredID)
Max(StartDate)
The results should be
ItemIdCost
1230.90
32112.72
99918.28
There is no primary key and duplicate entries are allowed.
The code I have gives acceptable performance ...extracting 50K rows from 1M+ in around 10 secs on an ageing SQL2000 box
But when I have a record such as this it doesnt return expected results:
SELECT '321', '3','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','2.5000' UNION ALL
(this line is commented in the code below)
Results are:
ItemIdCost
1230.90
3212.50
32112.72
99918.28
The row in bold (321/2.50) should NOT return becasue PLNO = 3....
Has anyone got any ideas please....?
Also are there any better methods in SQL2005?
Kind regards Graham
Note:
SELECT '999','4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50336','18.2800' UNION ALL
SELECT '999','4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50336','18.2800' UNION ALL
The above are duplicates and ARE intended .. I need to extract one row
SELECT '123', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','0.0000' UNION ALL
SELECT '123', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','0.9000' UNION ALL
From the above I need to select the second row becasue it has a later StartingDate
USE [tempdb]
GO
--- If table(s) already exists, drop
IF OBJECT_ID('tempdb..#phexample', 'U') IS NOT NULL
DROP TABLE #phexample
IF OBJECT_ID('tempdb..#table1', 'U') IS NOT NULL
DROP TABLE #table1
IF OBJECT_ID('tempdb..#table2', 'U') IS NOT NULL
DROP TABLE #table2
CREATE TABLE [#phexample]
(
[ItemID] [int] NOT NULL
, [PLNO] [int] NOT NULL
, [StartDate] [datetime] NOT NULL
, [EndDate] [datetime] NOT NULL
, [DateEntered] [datetime] NOT NULL
, [EnteredID] [int] NOT NULL
, [Cost] [decimal](9, 4) NOT NULL
)
ON [PRIMARY]
GO
--- Insert the test data into the test table(s)
INSERT INTO #phexample (
ItemID,
PLNO,
StartDate,
EndDate,
DateEntered,
EnteredID,
Cost
)
SELECT '123', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Apr 1 2009 12:00AM','46234','0.4300' UNION ALL
SELECT '123', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','46014','0.4400' UNION ALL
SELECT '123', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51707','0.6500' UNION ALL
SELECT '123', '4','May 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','0.7200' UNION ALL
SELECT '123', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','0.0000' UNION ALL
SELECT '123', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','0.9000' UNION ALL
SELECT '123', '4','Aug 1 2010 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50251','0.5000' UNION ALL
SELECT '123', '3','May 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','792','0.5000' UNION ALL
SELECT '123', '3','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','792','0.0000' UNION ALL
SELECT '123', '3','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','792','0.9000' UNION ALL
SELECT '123', '3','Aug 1 2010 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','251','0.5000' UNION ALL
SELECT '999','4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50336','18.2800' UNION ALL
SELECT '999','4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50336','18.2800' UNION ALL
SELECT '321', '4','Jan 1 2008 12:00AM','Dec 31 2008 12:00AM','Nov 1 2007 12:00AM','7460','8.7200' UNION ALL
SELECT '321', '4','May 1 2008 12:00AM','Dec 31 2008 12:00AM','May 1 2008 12:00AM','34012','9.0900' UNION ALL
SELECT '321', '4','Jun 1 2008 12:00AM','Dec 31 2008 12:00AM','May 1 2008 12:00AM','13801','11.3200' UNION ALL
SELECT '321', '4','Jan 1 2008 12:00AM','Dec 31 2010 12:00AM','Nov 1 2008 12:00AM','13801','10.0000' UNION ALL
SELECT '321', '4','Jun 1 2008 12:00AM','Dec 31 2010 12:00AM','May 1 2008 12:00AM','34010','11.0600' UNION ALL
SELECT '321', '4','Jan 1 2009 12:00AM','Dec 31 2010 12:00AM','Oct 1 2008 12:00AM','1956','12.6600' UNION ALL
SELECT '321', '4','Jan 1 2009 12:00AM','Dec 31 2010 12:00AM','Oct 1 2008 12:00AM','17343','12.6600' UNION ALL
SELECT '321', '4','May 1 2008 12:00AM','Dec 31 2010 12:00AM','Oct 1 2008 12:00AM','8144','8.9600' UNION ALL
SELECT '321', '4','Jun 1 2008 12:00AM','Dec 31 2010 12:00AM','Nov 1 2008 12:00AM','34010','11.0600' UNION ALL
SELECT '321', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Apr 1 2009 12:00AM','46234','10.4300' UNION ALL
SELECT '321', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Apr 1 2009 12:00AM','46234','10.4300' UNION ALL
SELECT '321', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','46014','12.4400' UNION ALL
SELECT '321', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51707','12.6500' UNION ALL
SELECT '321', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','12.7200' UNION ALL
SELECT '321', '3','Jun 3 2009 12:00AM','Dec 31 2010 12:00AM','Oct 1 2008 12:00AM','400','2.6600' UNION ALL
SELECT '321', '4','Jun 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','12.0000' UNION ALL
SELECT '321', '4','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','12.7200' UNION ALL
---SELECT '321', '3','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','51792','2.5000' UNION ALL ---this causes problems
SELECT '321', '4','Aug 1 2010 12:00AM','Dec 31 2010 12:00AM','Jun 1 2009 12:00AM','50251','13.0000'
SELECT DISTINCT --- extract results into temp table ---removing dupes with DISTINCT
#phexample.ItemID
, #phexample.Cost
, #phexample.StartDate
INTO [#table1]
FROM ( SELECT #phexample.ItemID
, #phexample.DateEntered
, MAX(#phexample.EnteredID) AS MaxEnteredID -- find max EnteredID from records below
FROM ( SELECT MAX(DateEntered) AS MaxDateEntered ---- find last date entered from records below
, ItemID
FROM ( SELECT ItemID
, DateEntered
, EnteredID
FROM #phexample
WHERE ( StartDate <= GETDATE() ) ----- limit records by todays date
AND ( EndDate >= GETDATE() )
AND PLNO = 4
) DT1
GROUP BY ItemID
) DT2
INNER JOIN #phexample
ON DT2.ItemID = #phexample.ItemID
AND DT2.MaxDateEntered = #phexample.DateEntered
GROUP BY #phexample.ItemID
, #phexample.DateEntered
) DT3
INNER JOIN #phexample
ON DT3.ItemID = #phexample.ItemID
AND DT3.DateEntered = #phexample.DateEntered
AND DT3.MaxEnteredID = #phexample.EnteredID
GO
SELECT ItemID
, MAX(StartDate) AS MaxStartDate ---find latest start date from records extacted above
INTO [#table2]
FROM #table1
GROUP BY ItemID
GO
SELECT #table1.ItemID
, #table1.Cost
FROM #table1
INNER JOIN #table2
ON #table1.ItemID = #table2.ItemID
AND #table1.StartDate = #table2.MaxStartDate
---END
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
August 9, 2009 at 11:02 am
First of all, I must say (bowing deeply with gratitude) thank you for all the setup data. It makes life real easy when trying to help someone because we can concentrate on the problem instead of asking a thousand questions about the data and the underlying tables. Well done.
You certainly have the right idea in isolating just the data you want to work with instead of working with a monster table. I did similar in the following. I tested it with the "troublesome row" uncommented and it produces the answer you were looking for. If I had to do the same to a million rows instead of a small subset of a million rows, I might try a different 2k (quirky update) method depending on performance. Here's a 2k solution...
--===== Do a little "Divide'n'Conquer" by isolating
-- just the rows we need to work with.
IF OBJECT_ID('TempDB..#WorkTable','U') IS NOT NULL
DROP TABLE #WorkTable
;
SELECT DISTINCT
ItemID,
PLNO,
StartDate,
DateEntered,
EnteredID,
Cost
INTO #WorkTable
FROM #phexample
WHERE StartDate = GETDATE()
;
--===== Solve the problem using SQL Server 2000 style code
-- (ie. ROW_NUMBER is not available). No longer any
-- need for date comparisons because only the correct
-- rows are in the work table.
SELECT wt.ItemID, wt.Cost
FROM
(--==== Find the MAX PLNO for each ItemID.
-- Shouldn't find MAX dates here because a
-- lower PLNO may have a larger date.
SELECT ItemID,
MAX(PLNO) AS MaxPLNO
FROM #WorkTable
GROUP BY ItemID
) mp
INNER JOIN
(--==== Ok, Find the MAX dates and ID's for each
-- ItemID/PLNO combo
SELECT ItemID,
PLNO,
MAX(DateEntered) AS MaxDateEntered,
MAX(EnteredID) AS MaxEnteredID,
MAX(StartDate) AS MaxStartDate
FROM #WorkTable
GROUP BY ItemID, PLNO
) msd
ON mp.ItemID = msd.ItemID
AND mp.MaxPLNO = msd.PLNO
INNER JOIN #WorkTable wt --This join puts it all together
ON msd.ItemID = wt.ItemID
AND msd.PLNO = wt.PLNO
AND msd.MaxDateEntered = wt.DateEntered
AND msd.MaxEnteredID = wt.EnteredID
AND msd.MaxStartDate = wt.StartDate
;
The same problem in 2k5 has a much easier solution thanks to ROW_NUMBER...
WITH cteSequenced AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY ItemID
ORDER BY ItemID DESC,
PLNO DESC,
DateEntered DESC,
EnteredID DESC,
StartDate DESC)
AS Sequence,
ItemID,
Cost
FROM #phexample
WHERE StartDate = GETDATE()
)
SELECT s.ItemID, s.Cost
FROM cteSequenced s
WHERE s.Sequence = 1
Heh... ya just gotta love the power they built into ROW_NUMBER. Let me know if you have any questions...
--Jeff Moden
Change is inevitable... Change for the better is not.
August 9, 2009 at 1:06 pm
Mr Moden...first of all thank you for your generous comments on the set up data...:blush:...It did take some effort to create and double check before posting but as your article suggests...you certainly get better and faster response...many thanks to you Sir.
Have tested your code and I think that I didnt make myself clear enough that I only required PLNO =4 (it was in the code but not in the post comments and notes...apologies)
if I change the commented code to:
SELECT '321', '5','Aug 1 2009 12:00AM','Dec 31 2010 12:00AM','Jun 1 2010 12:00AM','51792','2.5000' UNION ALL ---...it was giving me incorrect answer.
So I added PLN0 = 4 to the initial SELECT DISTINCT and removed the "mp" SELECT...this all appears to work as expected...still to test in anger on production though.
--===== Do a little "Divide'n'Conquer" by isolating
-- just the rows we need to work with.
IF OBJECT_ID('TempDB..#WorkTable','U') IS NOT NULL
DROP TABLE #WorkTable
;
SELECT DISTINCT
ItemID,
PLNO,
StartDate,
DateEntered,
EnteredID,
Cost
INTO #WorkTable
FROM #phexample
WHERE StartDate = GETDATE()
--- filter for PLNO will cut down on rows
AND PLNO = 4
;
--===== Solve the problem using SQL Server 2000 style code
-- (ie. ROW_NUMBER is not available). No longer any
-- need for date comparisons because only the correct
-- rows are in the work table.
SELECT wt.ItemID, wt.Cost
FROM
-- (--==== Find the MAX PLNO for each ItemID.
-- -- Shouldn't find MAX dates here because a
-- -- lower PLNO may have a larger date.
-- SELECT ItemID,
-- MAX(PLNO) AS MaxPLNO
-- FROM #WorkTable
-- GROUP BY ItemID
-- ) mp
-- INNER JOIN
(--==== Ok, Find the MAX dates and ID's for each
-- ItemID/PLNO combo
SELECT ItemID,
-- PLNO,
MAX(DateEntered) AS MaxDateEntered,
MAX(EnteredID) AS MaxEnteredID,
MAX(StartDate) AS MaxStartDate
FROM #WorkTable
GROUP BY ItemID --, PLNO
) msd
-- ON mp.ItemID = msd.ItemID
-- AND mp.MaxPLNO = msd.PLNO
INNER JOIN #WorkTable wt --This join puts it all together
ON msd.ItemID = wt.ItemID
-- AND msd.PLNO = wt.PLNO
AND msd.MaxDateEntered = wt.DateEntered
AND msd.MaxEnteredID = wt.EnteredID
AND msd.MaxStartDate = wt.StartDate
;
A quick test on initial SELECT DISTINCT returns 55K rows from 1M+ table in approx 2secs. on production box at work,..so not poor at all
I have separate index on each column...is this preferable or would it be better to have one index that covers all cols?...dont worry I can always test 🙂
Update....hmmm just found another issue with the data... entries with all columns identical only the cost is different !!!
suppose I will have to make an executive decision on which is correct :hehe:
I havent had time yet to play with 2005 code...but will do later on with a full set of data....looks powerful.
Once again, sincere thanks for your reply...and I did learn a new technique that will assist me in several other areas..creating all the MAX fields in one go and linking back to the original data...I always seemed to go step by step and eventually lose my way :crying:
Kind regards Graham
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
August 9, 2009 at 1:46 pm
Heh... "The devil's in the data", I always say. Sorry I missed the thing on PLNO = 4. I didn't spend much time with your code other than to get the results.
On the cost thing... My executive decision would be to show both the lowest and highest costs in the case of ties and let someone else make the decision. 😛
And thanks for posting the changes you made. Considering that there's just a single PLNO that you want to check, the changes do look correct. Thanks for the feedback, Graham.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 9, 2009 at 1:49 pm
Sorry... almost forgot... I suspect that the indexing you have will suffice although a covering index on just the columns in the first copy to the working table would actually derive a bit more performance.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 9, 2009 at 3:00 pm
On the cost thing... My executive decision would be to show both the lowest and highest costs in the case of ties and let someone else make the decision. 😛
.
Heh..can just imagine the conversation:
Very Important Person...” what the *** is going on here.. why have I got two sets of figures?”
gah...”well...your team have entered two costs and I dont know which is correct..can you tell me please?”
VIP...”NO...you are the IT GIT...sort it!”
gah..”hmmm....perhaps you could ask your team to double check before inputting the data?”
VIP...”NO...they are FAR too busy to do that...you are the IT GIT...sort it!”
gah ...” well...we could ask the Application Vendor to amend their program to prevent such errors”
VIP...” you are the IT GIT...sort it!”
gah...” here is the capex form for your signature”
VIP...”How much !!!!!...you are the IT GIT..did you recommend this software?”
gah..”No ...Mr Very Important Person Sir...you did”
:w00t::w00t::w00t:
more seriously...thanks for 2k5 code...I LIKE A LOT !!!!
had been looking at ROW_NUMBER() this afternoon referring to another of your posts today...but could not seem to grasp the concept until I saw ORDER BY...DESC
regards gah
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
August 9, 2009 at 4:25 pm
Heh... I like the last line of that conversation except that's normally where I break out the pork chops with the bone in them. 😛 Getting back to the descision on that... there's always AVG. :hehe: Actually, is there a way to get the entry time on the date entered column? If you can't change the program to do it, an "override" trigger would do it.
Yeah... I agree... ROW_NUMBER is the berries. It makes life pretty simple. Now, if they could just make the windowed aggregate functions like SUM() OVER work correctly.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 10, 2009 at 11:41 am
Getting back to the descision on that... there's always AVG. :hehe: Actually, is there a way to get the entry time on the date entered column? If you can't change the program to do it, an "override" trigger would do it.
The data and app are not ours...its a 3rd pty software house on a Progress 9.1E db....I only rip and strip the data into SQL for reporting...therefore can only work with what is presented via ODBC...and only way to add triggers/cols etc is by throwing money at them...:crazy:
Actually the entry time is already there (EnteredID) as number of secs past midnight !!
This probably worked ok when the data entry was manual...no one could possibly enter two records within a single second. However they then allowed users to import data directly from excel spreadsheets...complete with errors and duplicates and little data validation....so now we can have hundreds of imported records with same date/time :rolleyes:
As for AVG...nice idea, but I think it will cause more problems because any end user drilling into one my reports will no longer be able to xref to the cost data in the app.
Thanks for your interest and inputs...it has been very much appreciated.
Kind regards gah
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply