Assistance On Reporting with some Dirty Data

Question

Assistance On Reporting with some Dirty Data

matthew.brown-746365

SSC Enthusiast

Points: 122
More actions
June 27, 2012 at 4:46 pm

#258206

Hi Everyone, hoping someone here can assist with this issue my team and I is having. Hopefully I posted this in the correct area.
SQL Server 2000, SQL Query Analyzer 8 (hopefully we'll be upgrading shortly, completely out of my hands, my team does more data mining and report building).
We have a start point for a file (Ord_StartDate), and an end point for a file (Ord_EndDate). Now, one of our clients is interfacing into our system, and either them or our own internal teams can put a file on Hold then Resume it.
Each Hold has a Received and Sent timestamp, if it's from the Client, then Hold Received (Hold_Rcv) gets the timestamp and if we put it on hold it's marked sent (Hold_Snt). Same thing if it's resumed, Resume_Snt/Resume_Rcv.
Now to compound this, mostly them (but of course our teams as well) are hitting multiple holds, and or multiple resumes. As an example, we could have a Hold_Snt on 6/1 @ 11:35am, Hold_Rcv is null, another Hold_Snt on 6/6 @ 3:30pm, Hold_Rcv on 6/6 @ 4:30pm, then finally a Resume_Snt on 6/7 @ 2:00pm.
My goal is to figure out the total Turn Time between Ord_StartDate and Ord_EndDate, backing out any "Hold Time." We've basically come to the point that Hold Time will be defined as the first Hold (snt/rcv doesn't matter) to the First Resume. Then, if another Hold is set after the 1st Resume, starting hold time up again until the next Resume.
I'm losing sleep over this, but at this point, I'm thinking of using a few simple Union queries to get all of the timestamps for the holds/resumes in one column, so it looks like something like this:
OrderID | Type | Timestamp
1 | Hold Snt | 2012-06-01 11:35
1 | Hold Snt | 2012-06-06 15:30
1 | Hold Rcv | 2012-06-06 16:30
1 | Res Snt | 2012-06-07 14:00
1 | Hold Rcv | 2012-06-18 10:30
1 | Res Rcv | 2012-06-24 8:30
Then of course more OrderID's and so on. I'm thinking if I do this, then at least all of my dates are in chronological order. The issue then becomes, in SQL, how do I find the difference between the first Hold and first Resume. And to compound, the 2nd Hold/Resume and the 3rd, etc. In my head, once I have them all in order, I then break them out and hardcode up to say 5 different Hold/Resume scenarios in different columns (Hold 1|Resume 1|Turn Time 1|Hold 2|Resume 2|Turn Time 2|etc). Then I should be able to use my calculations on each column. I'm willing to say if there are more than 5 (I may break out to 10 depending how this works) Hold/Resume possible combos, then forget about them.
How do I get these out to separate columns like I want? Or am I going about this completely the wrong way? Any guidance, assistance, etc would be appreciated. Thank you very very much!!!
-matt

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 1

Yikes, alright, that's complex and I kinda understand the issue.

For starters, can you take a quick breeze-through on the first link in my signature? It'll walk you through what's the easiest way for us to help you in schema and sample data setups so we have something to work against.

Now, the majority of the work you're going to be doing here to get this organized will actually be down at the T-SQL level, not the report, and a lot of the tricks used for things like 'data islanding' (local term to the forum) using the row_number() function and the like are not going to be available. So, we'll have to go old-school to get this figured out... most likely with a Serial Update query but we'll get there eventually.

So, to help complement the description you gave above, can you provide us with some sample data/schema/etc so we can see what you see for your example? It'll help us ask the direct questions we'll need to so we can help nail down exactly what you're looking for in our own heads and can get you started on the code.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

matthew.brown-746365 SSC Enthusiast Points: 122 More actions · Answer 2

Wow, I did everything I'm NOT supposed to do. I'm sorry Craig, thank you for your patience!

Here's just a real basic table where our Order information comes from, it has many more fields than this, but for sake of this example I think this is all we need:

CREATE TABLE #ORDERS

(OrderId int,

StartDate datetime,

EndDate datetime)

INSERT INTO #ORDERS VALUES (12345,'2012-05-28 9:00','2012-06-28 18:30')

OrderId is the primary and Order Number, plus the Start and End time, the difference between those 2 dates is the main point of this excercise.

Here's a sample table that I'm pulling the Hold and Resume data from. The Hold "Task" (as we call them) is TaskId 1001, and the Resume is 1002. As you can see, each one can have a timestamp for the Snt And/Or Rcv. The OrderId is obviously the OrderId that the task is tied to, and the TaskId column would be my primary key column.

CREATE TABLE #TASKS

(TaskId int,

OrderId int,

TaskTypeId varchar(50),

SntDt datetime,

RcvDt datetime)

INSERT INTO #TASKS VALUES (1,12345,1001,'2012-06-01 11:35',null)

INSERT INTO #TASKS VALUES (2,12345,1001,'2012-06-06 15:30','2012-06-06 16:30')

INSERT INTO #TASKS VALUES (4,12345,1001,null,'2012-06-18 10:30')

INSERT INTO #TASKS VALUES (3,12345,1002,'2012-06-07 14:00',null)

INSERT INTO #TASKS VALUES (5,12345,1002,null,'2012-06-24 8:30')

SELECT * FROM #TASKS

DROP TABLE #TASKS

Hopefully that helps show you what I'm working with. And of course, there's however many different Orders as well.

To reiterate my goal, using this example, I would need to calculate the Turn Time between the first Hold and the Next chronological Resume (so '2012-06-01 11:35' to '2012-06-07 14:00') then also calculate any subsequent Hold/Resume times that occure on the Order after that.

Once I have that total "Hold Time" - i'll then back that out of my Turn Time i'm calculating for the order Start Time and Complete Time.

My original thought was to either get all of the Holds in one column, in chronological order, and the Resumes in another also in chronological order.

SELECT

orders.OrderId 'Order Num'

,CASE WHEN TaskTypeId = 1001 THEN SntDt END 'Hold'

,CASE WHEN TaskTypeId = 1002 THEN SntDt END 'Resume'

FROM

#ORDERS orders

INNER JOIN #TASKS tasks ON orders.OrderId = tasks.OrderId

UNION

SELECT

orders.OrderId 'Order Num'

,CASE WHEN TaskTypeId = 1001 THEN RcvDt END 'Hold'

,CASE WHEN TaskTypeId = 1002 THEN RcvDt END 'Resume'

FROM

#ORDERS orders

INNER JOIN #TASKS tasks ON orders.OrderId = tasks.OrderId

DROP TABLE #TASKS

DROP TABLE #ORDERS

Then once I have all of this, I could possibly do a Select within a Select and somehow Hard Code a column for the 1st Hold Resume, 2nd, etc, all the way to say 5th Combination. That should catch every Hold and Resume combo we want to keep. If, chronologically, there is a Hold, then another 2 Holds, then a Resume, we will go from the First Hold to the First Resume, "throwing out" the Holds that are in between.

When I couldn't figure that out, I figured put them all in one column, chronologically, but I'm stuck there too. And here we are. I think my main issue is I really can't wrap my head around any one way to do this. And I think being stuck on an older SQL Server/Query Analyzer isn't making this any easier for me.

Thanks again for your patience Craig! Hopefully someone has an idea or 2. I've been building in SQL for over 4 years now, but have never properly been schooled on this, AND my team has very strict access, which doesn't help either.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 3

matthew.brown-746365 (6/29/2012)
Wow, I did everything I'm NOT supposed to do. I'm sorry Craig, thank you for your patience!

Not a problem. It's hard to know what to do when you're first getting here since you're very close to what you're seeing and it makes sense to you. That article just helps to explain how to make sure we do to. :w00t: Very nice work.

Here's just a real basic table where our Order information comes from, it has many more fields than this, but for sake of this example I think this is all we need:

You got it. Isolate it down to just the issue and the rest of the code won't get confused. Just keep in mind if you come here for optimization help you don't want to actually shrink it down. In those cases we need the kitchen sink.

To reiterate my goal, using this example, I would need to calculate the Turn Time between the first Hold and the Next chronological Resume (so '2012-06-01 11:35' to '2012-06-07 14:00') then also calculate any subsequent Hold/Resume times that occure on the Order after that.
Once I have that total "Hold Time" - i'll then back that out of my Turn Time i'm calculating for the order Start Time and Complete Time.

How do you want to caculate Hold Time? Hours, Minutes, Seconds? Months? If something enters the next designation (IE: 28 hours) is that 1 day or 2 for your purposes?

My original thought was to either get all of the Holds in one column, in chronological order, and the Resumes in another also in chronological order.
Then once I have all of this, I could possibly do a Select within a Select and somehow Hard Code a column for the 1st Hold Resume, 2nd, etc, all the way to say 5th Combination.

That only works for THIS combination. Any time you start creating non-normalized structures to determine these calculations you end up with the potential for a user to hose you up later by using +1 more iteration of whatever it is you prepped for. While it can work, it ends up shooting you in the foot. However, you're on the right track for what I have in mind.

That should catch every Hold and Resume combo we want to keep. If, chronologically, there is a Hold, then another 2 Holds, then a Resume, we will go from the First Hold to the First Resume, "throwing out" the Holds that are in between.

Excellent clarification. What happens on two resumes before the next hold? Use the first one or the last one? You mention the first resume but I just want to make sure.

When I couldn't figure that out, I figured put them all in one column, chronologically, but I'm stuck there too. And here we are. I think my main issue is I really can't wrap my head around any one way to do this. And I think being stuck on an older SQL Server/Query Analyzer isn't making this any easier for me.

It's not, we're gonna end up goin' old-school. What you'd really rather use here are windowing functions (Row_Number() with partitioning and ordering, in particular) to get out of this mess. It's not available in 2k.

Thanks again for your patience Craig! Hopefully someone has an idea or 2. I've been building in SQL for over 4 years now, but have never properly been schooled on this, AND my team has very strict access, which doesn't help either.

My pleasure. Hit up those clarifications above and I have a few ideas... but I'm going to warn you now, this is going to get UGLY. But it's doable.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

matthew.brown-746365 SSC Enthusiast Points: 122 More actions · Answer 4

Not a problem. It's hard to know what to do when you're first getting here since you're very close to what you're seeing and it makes sense to you. That article just helps to explain how to make sure we do to. :w00t: Very nice work.

Thank you!

How do you want to caculate Hold Time? Hours, Minutes, Seconds? Months? If something enters the next designation (IE: 28 hours) is that 1 day or 2 for your purposes?

I'm measuring everything in Business Hours, which are 9:00am to 9:00pm. With the business hours being exactly 12 hours, we consider 12 hours 1 business day. So if something hits 28 hours, I'll actually consider that 2.17 business days (two point something, my math could be off of course).

As far as how we're doing Business Hours, we have a function set up where we state the start time, end time, starting business hour, ending business hour, and it takes weekends and holiday into consideration, shooting out the hours.

That only works for THIS combination. Any time you start creating non-normalized structures to determine these calculations you end up with the potential for a user to hose you up later by using +1 more iteration of whatever it is you prepped for. While it can work, it ends up shooting you in the foot. However, you're on the right track for what I have in mind.

You are absolutely 100% right, most of my life in this position has been working around exceptions (i'm sure we all deal with this). In my mind, I limited this to 5 occurences so that I saw an end and kept my sanity 🙂

With that being said, I've never ONCE been successful at dynamic columns in SQL, they hurt my brain but I'm willing to give them a shot if that's where you are heading!

Excellent clarification. What happens on two resumes before the next hold? Use the first one or the last one? You mention the first resume but I just want to make sure.

In that scenario, we'll use the First Resume, ignoring the Second One. I haven't hit any scenarios with that when i'm fishing for examples, but you are correct that could happen, unfortunately.

It's not, we're gonna end up goin' old-school. What you'd really rather use here are windowing functions (Row_Number() with partitioning and ordering, in particular) to get out of this mess. It's not available in 2k.

Yes, Row_Number() is a function I DREAM about having 😀 Yet it's something so simple.

My pleasure. Hit up those clarifications above and I have a few ideas... but I'm going to warn you now, this is going to get UGLY. But it's doable.

I'll do my best to keep up. It's an ugly situation, no ones fault except teams that I will not mention!!!;-)

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 5

Running through an iteration on the methodology and I've run into a bit of a confusion.

I'd thought SntDt and RcvDt would simply be directly associated to if it was a 1001, 1002. I remember now it's associated to who issued it.

However, what's the rules for the 'ruling' result when the same line contains both? IE:

INSERT INTO #TASKS VALUES (2,12345,1001,'2012-06-06 15:30','2012-06-06 16:30')

Is it the minimum value, or does Snt/Rcv get the priority position in this possibility? I'm going to work off the assumption you want the MIN.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 6

End to end code. I just got SLAMMED at work before the weekend starts up so I need to clear a few things, luckily I comment as I go. I'll poke back at this either over the weekend or Monday and see if you have any questions. This code is successful in an 8.0 style db in 2k5, so it SHOULD be end to end for you. Fire it up in its own window and you'll see what the results look like, and they can be pared down.

Please, let me (and others) know if you have any questions. Even if I'm not around someone else should be able to answer as well.

EDIT: I should mention I believe you can swap the query out to a FAST_Forward read_only because I removed the internal cursor updates and went against the source table, but I didn't have time to rewrite and test removing that setting. As long as you use it against a #tmp you're fine. Just never run cursors off of main tables until you're very familiar with the impact you're going to level against it.

IF OBJECT_ID('tempdb..#Orders') IS NOT NULL

DROP TABLE #ORDERS

IF OBJECT_ID('tempdb..#Tasks') IS NOT NULL

DROP TABLE #TASKS

CREATE TABLE #ORDERS

(OrderId int,

StartDate datetime,

EndDate datetime)

INSERT INTO #ORDERS VALUES (12345,'2012-05-28 9:00','2012-06-28 18:30')

CREATE TABLE #TASKS

(TaskId int,

OrderId int,

TaskTypeId varchar(50),

SntDt datetime,

RcvDt datetime)

INSERT INTO #TASKS VALUES (1,12345,1001,'2012-06-01 11:35',null)

INSERT INTO #TASKS VALUES (2,12345,1001,'2012-06-06 15:30','2012-06-06 16:30')

INSERT INTO #TASKS VALUES (4,12345,1001,null,'2012-06-18 10:30')

INSERT INTO #TASKS VALUES (3,12345,1002,'2012-06-07 14:00',null)

INSERT INTO #TASKS VALUES (5,12345,1002,null,'2012-06-24 8:30')

/*

Step 1: Find the patterns

RBAR or Serial Update. RBAR first. KISS for now.

*/

IF OBJECT_ID('tempdb..#WorkingTable') IS NOT NULL

DROP TABLE #WorkingTable

CREATE TABlE #WorkingTable

(TaskIDINT PRIMARY KEY NONCLUSTERED,

OrderIDINT,

TaskTypeIDVARCHAR(50),

TaskDateDATETIME,

GroupValINT

)

CREATE CLUSTERED INDEX idx_c_WorkingTable ON #WorkingTable (OrderID, GroupVal)

DECLARE @WrkGroupVal INT,

@LastEvent VARCHAR(20),

@LastOrderIDINT,

@TaskIDINT,

@OrderIDINT,

@TaskTypeIDVARCHAR(50),

@TaskDateDATETIME,

@GroupValINT

SELECT@WrkGroupVal = 0,

@LastOrderID = -1,

@LastEvent = 'Nothing'

-- First, lets get the data cleaned up...

INSERT INTO

#WorkingTable

SELECT

TaskID,

OrderID,

TaskTypeID,

CASE WHEN RcvDt > SntDt THEN SntDt

WHEN SntDt >= RcvDt THEN RcvDt

ELSE ISNULL( SntDt, RcvDT)

END AS TaskDate,

NULL AS GroupVal

FROM

#Tasks

-- NEVER USE DYNAMIC OR FOR UPDATE CURSOR OPTIONS ON ANYTHING BUT A LOCALLY GENERATED TEMP TABLE.

-- YOU WILL HOSE ACCESS TO YOUR REAL TABLES HORRIBLY!!!

DECLARE Looper CURSOR FOR SELECT * FROM #WorkingTable ORDER BY OrderID, TaskDate FOR UPDATE

OPEN Looper

FETCH NEXT FROM Looper INTO @TaskID, @OrderID, @TaskTypeID, @TaskDate, @GroupVal

WHILE @@FETCH_STATUS = 0

BEGIN

-- Change our grouping on passing from a Resume to a Hold.

IF @LastEvent = 'Resume' AND @TaskTypeID = '1001' /*Hold*/

SET @WrkGroupVal = @WrkGroupVal + 1

-- When we cross over to a new order, reset the grouping value to be assigned.

-- Do this after the above event detection just to ease the number of things we

-- check at once.

IF @OrderID <> @LastOrderID

BEGIN

SET @WrkGroupVal = 0

SET @LastOrderID = @OrderID

END

SET @LastEvent = CASE @TaskTypeID WHEN '1001' THEN 'Hold' WHEN '1002' THEN 'Resume' ELSE '' END

UPDATE #WorkingTable

SET GroupVal = @WrkGroupVal

WHERE TaskID = @TaskID

FETCH NEXT FROM Looper INTO @TaskID, @OrderID, @TaskTypeID, @TaskDate, @GroupVal

END

CLOSE Looper

DEALLOCATE Looper

-- So, now we have a working set of groups. With more OrderIDs we could test the 'reset' of the values.

-- We also have no code that deals with an 'unreleased' hold or an 'unheld' release. Well, oh well.

--SELECT * FROM #WorkingTable

-- Using another temp table here to help it be clear in what's occurring.

IF OBJECT_ID('tempdb..#WrkTblPivot') IS NOT NULL

DROP TABLE #WrkTblPivot

CREATE TABLE #WrkTblPivot

(OrderIDINT,

GroupValINT,

HoldDtDATETIME,

ResumeDTDATETIME)

/* Next up, turn these into a usable time comparison*/

INSERT INTO #WrkTblPivot

SELECT

Header.OrderID,

Header.GroupVal,

(SELECT MIN( TaskDate)

FROM #WorkingTable

WHEREOrderID = Header.OrderID

AND GroupVal = Header.GroupVal

AND TaskTypeID = '1001') AS HoldDt

,(SELECT MIN( TaskDate)

FROM #WorkingTable

WHEREOrderID = Header.OrderID

AND GroupVal = Header.GroupVal

AND TaskTypeID = '1002') AS ResumeDt

FROM

(SELECT DISTINCT

OrderID,

GroupVal

FROM

#WorkingTable

) AS Header

-- Organize this to your order and find the difference.

SELECT

o.OrderID,

o.StartDate,

o.EndDate,

DATEDIFF( n, o.StartDate, o.EndDate) / 60. AS TotalOrderTimeInHours,

( DATEDIFF( n, o.StartDate, o.EndDate) / 60.) - drv.HoursTaken AS OrderTimeWithoutHoldsInHours

FROM

#Orders AS o

JOIN

(-- Get your diff in hours... using minutes for the decimal positioning.

SELECT

OrderID,

-- Use the decimal here to force a decimal conversion. 60 is an int, 60. is a decimal.

SUM( DATEDIFF( n, HoldDt, ResumeDt) / 60.) AS HoursTaken

FROM

#WrkTblPivot AS wtp

GROUP BY

OrderID

) AS drv

Ono.OrderID = drv.OrderID

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

matthew.brown-746365 SSC Enthusiast Points: 122 More actions · Answer 7

Craig, wow! Thank you so much, I wasn't expecting to get nearly this kind of effort from anyone (I don't mean that as an insult, just I know everyone here has lives as well). Very grateful for this!

I'm also getting slammed, beginning of the month is always such a party here. Running through this, I think i'm following most of it as well, and it IS running without any problems, against my temp tables of course.

I should have more time either today or tomorrow to really sit down and go through all of this.

And to answer your other question, yes I personally would use the Minimum in that case (snt and rcv both not null on same line).

Wow wow wow, again, thank you! I'll be in touch soon!

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 8

matthew.brown-746365 (7/2/2012)
Craig, wow! Thank you so much, I wasn't expecting to get nearly this kind of effort from anyone (I don't mean that as an insult, just I know everyone here has lives as well). Very grateful for this!

No insult taken. 🙂 I'm very glad this was what you were basically needing. Happy to help, not a lot of old-schoolers who remember the old methods anymore... though, that's probably a good thing.

Running through this, I think i'm following most of it as well, and it IS running without any problems, against my temp tables of course.

Good, I'm glad to see you're taking the time to understand it. For optimization pieces of that could be condensed into single statements, but it would be a lot harder to follow. It'll depend on if you need performance over maintenance... and I tend to lean towards maintenance in example code that isn't performance based changes.

Wow wow wow, again, thank you! I'll be in touch soon!

😎 Reactions like that are one of the reasons I love being able to hang around here and help occassionally. Come on back with any confusion or questions you may end up with.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA