Stuck with a unique JOIN scenario

Question

Stuck with a unique JOIN scenario

Steven Aiello

SSChasing Mays

Points: 604
More actions
November 12, 2015 at 2:01 pm

#322503

I have a scenario where I need to return values in a very unique way given the schema that is in place that I can't modify.
I have the following tables:
The Workflow table will contain workflow records. From a business perspective Workflows are like decision trees that can contain/kickoff zero/one/multiple child workflows. But each workflow, whether it's a child or parent, will contain a record in this table. The other gotcha is that there is no limit to the number of children a workflow can contain. One parent workflow can contain multiple children, which also contain children, etc., etc. And the parent workflow will have it's own steps (WorkflowDetails) that do its' own thing.
CREATE TABLE [dbo].[Workflow] (
Id int IDENTITY(1,1) NOT NULL,
Name nvarchar(30) NULL,
RequestedOn datetime NULL,
InternalStatus nchar(2) NULL,
LastUpdatedDateTime datetime NULL
)
Example Data:
Id Name RequestedOn InternalStatusLastUpdatedDateTime
20379Child3 2015-11-06 13:53:07.043 FN2015-11-06 13:53:10.663
20378Child2 2015-11-06 13:53:04.640 FN2015-11-06 13:53:05.873
20377Child1 2015-11-06 13:51:46.217FN2015-11-06 13:53:03.580
20376Parent1 2015-11-06 13:51:26.187FN2015-11-06 13:53:11.770
20375Parent2 2015-11-05 19:53:59.110FN2015-11-05 19:54:01.530
18217 Child4 2015-08-18 21:09:27.797 AB 2015-08-18 21:11:04.573
18216 Parent3 2015-8-18 21:09:24.753 AB 2015-08-18 21:11:04.513
The WorkflowDetails table will contain all of the steps within every Workflow.
CREATE TABLE dbo.WorkflowDetails (
ID int IDENTITY(1,1) NOT NULL,
ExecutionID int NULL, --This is Workflow.Id from above
Action nvarchar(100) NULL
)
WorkflowDetails example data:
Id ExecutionIDAction
30508420376Finish Marker1
30508320376Create Contract
30508220376Create Contract Details
30508120379Set Workflow output variable3
30508020379VM Details
30507920378Rule1
30507820379Decision1
30507720379Start
280376 18217 Choice1
280375 18217 Gate2
280370 18216 CallChildWorkFlow1
280369 18216 Decision4
The only way to map what workflows are tied to one another in the parent/child relationship is with another table called WorkflowRelations
CREATE TABLE [dbo].[WorkflowRelations](
[ExecutionId] [int] NOT NULL,
[ParentExecutionId] [int] NOT NULL,
[TopParentExecutionId] [int] NOT NULL
)
Sample data from WorkflowRelations:
ExecutionId ParentExecutionIdTopParentExecutionId
20377 20376 20376
20378 20376 20376
20379 20376 20376
18217 18216 18216
Things get even more interesting because there is a Request table that houses those processes that kick off the main workflow.
CREATE TABLE Request(
id int IDENTITY(1,1) NOT NULL,
user_id nvarchar(50) NULL,
productId nvarchar(50) NULL,
execution_id int NULL,
orderNum int NULL,
createDate datetime NULL
)
The Request.execution_id is always the parent workflow execution Id.
Example data from Request:
iduser_idproductId execution_id orderNum createDate
1289testacctVMSetup 20376 1285 2015-11-06 08:51:23.493
1259 testacct SecurityBadge 18216 1259 2015-08-18 17:09:24.067
So now my issue....For each Request.execution_id that has the potential to have it's own workflowdetails, child workflowdetails, child of child workflowdetails, etc. I need to return one row that contains the following:
For 20376:
RequestNo - which is Request.orderNum + '-' + Request.execution_id = 1285-20376
Request.user_id = testacct
Request.productId = VMSetup
Request.createDate = 2015-11-06 08:51:23.493
Workflow.Name = Parent1 - Because I only want to track the parent Workflow....if it exists as the engine may have died after the Request was generated and before the Workflow kicked off
Workflow.InternalStatus = FN - This is the latest status of Parent1. Potential values are as follows:
WHEN 'FN' THEN 'Finished'
WHEN 'FE' THEN 'Finished with Errors'
WHEN 'AB' THEN 'Aborted'
WHEN 'EX' THEN 'Running'
WHEN 'PA' THEN 'Paused'
WHEN 'SL' THEN 'Waiting'
ELSE 'Other'
WorkflowDetails.ExecutionId = 20376 - The issue is I want the last WorkflowDetails.ExecutionId that exists for the entire chain. In this case since the full workflow finished, it's the last step of Parent1 (20376). But, if for some reason the workflow failed in Child2 at WorkflowDetails.Id = 305079 and everything above 305079 didn't happen due to this, I'd need 20378 to be the value that is returned here.
WorkflowDetails.Action - name associated with WorkflowDetails.ExecutionId
The previous person that wrote the query used a CURSOR and I'm trying to rewrite using a set-based approach because the cursor took over 4 minutes to run with the amount of data.
Here is the script that I have but I'm not getting the correct child workflow.
DECLARE @ParamStartDate datetime = NULL, @ParamEndDate datetime= NULL-- Input dates
, @ServiceId nvarchar(max)= NULL-- To filter by Service
, @ParentWFName nvarchar(max)= NULL-- To filter by Parent WF Name
, @ReportType nvarchar(30) = 'Debug'
DECLARE @StartDate datetime = COALESCE(@ParamStartDate,DATEADD(Month,-3,GETDATE()))
DECLARE @TempEndDate date = COALESCE(@ParamEndDate,GETDATE())
DECLARE @EndDate datetime = cast(cast(@TempEndDate as nvarchar(10)) + ' 23:59:59' as datetime)
---==================================================================================================
-- DECLARE Table to store results from processing each requests getting details from the executed WFs
DECLARE @WFdetails TABLE(ParentWF_ExecutionID int, LastWF_Name nvarchar(100), LastWF_InternalStatus nchar(2)
, LastWF_ExecutionID int, LastWF_ActionName nvarchar(100))
DECLARE @WFName nvarchar(100)
DECLARE @WFId int
DECLARE @TotalRequestsReported int
-- =================
-- GET THE LIST OF REQUESTS
SELECT
RD.id, RD.user_id, RD.productId, RD.execution_id, RD.orderNum, RD.createDate
INTO #RequestDetailedList
FROM [dbo].[Request] RD with (nolock)
LEFT OUTER JOIN Workflow SWE with (nolock) on
RD.execution_id = SWE.id --In case there are any requests that don't have an associated Workflow
WHERE createDate >= @StartDate
AND createDate <= @EndDate
AND (@ServiceName IS NULL OR RD.productId=@ServiceId) --If we wanted to target a specific product or request
AND (@ParentWFName IS NULL OR SWE.Name=@ParentWFName)
order by id desc
--- Removed cursor and went with set based approach below
-- DECLARE cursor to iterate on Requests
/*DECLARE WFLIST CURSOR FOR
SELECT execution_id from #RequestDetailedList with (nolock)
-- Iterate on the list of requests getting details from the WF execution
OPEN WFLIST
FETCH NEXT FROM WFLIST
INTO @WFId
WHILE @@FETCH_STATUS = 0
BEGIN
-- FOR EACH EXECUTION_ID from the Request list, get the last executed action
INSERT INTO @WFdetails
SELECT TOP 1 @WFId as ParentWF_ExecutionID
, SWE.Name as LastWF_Name
, SWE.InternalStatus as LastWF_InternalStatus
, SWED.ExecutionID as LastWF_ExecutionID
, Action as LastWF_ActionName
FROM [dbo].WorkflowDetails SWED with (nolock)
INNER join [dbo].[Workflow] SWE with (nolock)
on SWED.ExecutionID = SWE.ID
where SWED.ExecutionID in (
SELECT exid FROM dbo.udf_get_related_workflows2(@WFId) --this UDF returns a table containing the parent Workflow and all children under parent
) order by SWED.Id desc
FETCH NEXT FROM WFLIST
INTO @WFId
END
CLOSE WFLIST
DEALLOCATE WFLIST
*/
INSERT INTO @WFdetails
SELECT A.ParentWF_ExecutionID
, A.LastWF_Name
, A.LastWF_InternalStatus
, A.LastWF_ExecutionID
, SWED.Action as LastWF_ActionName
FROM
[dbo].WorkflowDetails SWED with (nolock)
INNER JOIN
(
SELECT RDL.execution_id as ParentWF_ExecutionID
, SWE.Name as LastWF_Name
, SWE.InternalStatus as LastWF_InternalStatus
, SWED.ExecutionID as LastWF_ExecutionID
, MAX(SWED.Id) AS MaxSWEDId
FROM [dbo].WorkflowDetails SWED with (nolock)
INNER JOIN #RequestDetailedList RDL on SWED.ExecutionID = RDL.execution_id
INNER join [dbo].Workflow SWE with (nolock)
on SWED.ExecutionID = SWE.ID
where SWED.ExecutionID in (
SELECT exid FROM dbo.udf_get_related_workflows2(RDL.execution_id)
)
GROUP BY RDL.execution_id
, SWE.Name
, SWE.InternalStatus
, SWED.ExecutionID ) A ON A.MaxSWEDId = SWED.ID
order by SWED.Id desc
-- CREATE the final result with details for each request
select cast(RDL.order_number as nvarchar(20)) + '-' + cast(RDL.execution_id as nvarchar(20)) as RequestNo
, RDL.user_logon as Request_SubmittedBy
, RDL.product_name as Request_ServiceName
, RDL.createDate as Request_CreateDate
, SWE.Name as ParentWF_Name
, SWE.InternalStatus as ParentWF_InternalStatus
, LastWF.*
, (CASE LastWF_InternalStatus -- Based on the internal status and the last action executed, the request get the real final status
WHEN 'FN' THEN 'Finished Unexpectedly'
WHEN 'FE' THEN 'Finished with Errors'
WHEN 'AB' THEN 'Aborted'
WHEN 'EX' THEN 'Running'
WHEN 'PA' THEN 'Paused'
WHEN 'SL' THEN 'Waiting'
ELSE 'Other'
END) as ServiceRealStatus
, CAST(YEAR(RDL.createDate) as varchar(4)) + '-' + RIGHT('0'+CAST(MONTH(RDL.createDate) as varchar(2)),2) AS Request_MonthYear
into #Results
from @WFdetails LastWF
left JOIN #RequestDetailedList RDL with (nolock)
on RDL.execution_id = LastWF.ParentWF_ExecutionID
left join[dbo].Workflow SWE with (nolock)
on SWE.ID=RDL.execution_id
order by Request_CreateDate DESC
SET @TotalRequestsReported = @@ROWCOUNT
-- RETURN THE EXPECTED RESULTS BASED ON THE REPORT TYPE
-- ============================================================
IF @ReportType = 'Debug'
SELECT * FROM #Results ORDER BY ServiceRealStatus
Any thoughts on my set-based query to return the correct data so that I can find the max(SWED.Id) for the parent/child in the Request and then associate it back to the original parent Workflow and Request? If you use my data and query above, the data for 18216 doesn't come out correctly. It's only finding the max WorkflowDetailsId for 18216 where it needs to pull in the max WorkflowDetailsId for 18217 that is associated with it.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

SoHelpMeCodd SSCertifiable Points: 5871 More actions · Answer 1

I am sure what you wrote took a considerable amount of time to write. But I am afraid at least I cannot quickly determine a solution to your problem without seeking many clarifications. To avoid a need to seek many clarifications, I think readers (or at least I) would be more able to address your problem, if you minimized the use of English and maximized the use of T-SQL. For example, if you were to provide readers with no more than one runnable script, we may be able to help. That script should include ALTER TABLE ... CONSTRAINT ... declarations (instead of writing "This is Workflow.Id from above" or writing "another in the parent/child relationship ", or not identifying a primary key in table WorkflowRelations, etc.), should include INSERT statements (instead of "Sample data from" or "Example data"), should omit columns or objects that are not needed in the T-SQL (such as the Id column for the Workflow table, @ReportType, etc.), and should include complete DDL for all objects (such as SWExecute and udf_get_related_workflows2). The script you offer should be followed by the actual results (using the scripted INSERT data) and the expected results. If you can run that script in a new, empty database, you are assured that we can run the same script, which means we will know your foundation :). I realize what I just wrote is a lot of work, but in addition to helping us, it will help you boil a problem down to its barest form;-).

Looking at the overall T-SQL, I am not certain the use of a cursor is the cause of your performance concern. It appears there are other problems with the T-SQL. Because I cannot understand your problem (please accept my apologies), I will instead offer you some general performance tips.

Table variables

DECLARE @WFdetails TABLE

(ParentWF_ExecutionID int,

LastWF_Name nvarchar(100),

LastWF_InternalStatus nchar(2),

LastWF_ExecutionID int,

LastWF_ActionName nvarchar(100))

are convenient, but SQL Server 2008 always considers a table variable to contain merely one row (one page of IO). When a table variable is used in a JOIN

from @WFdetails LastWF

left JOIN #RequestDetailedList RDL

there is a distinct risk that SQL Server will incorrectly assume the table variable is less costly (has fewer rows) than the table being JOINed (#RequestDetailedList). You can avoid that risk by instead using a Common Table Expression (CTE). You already have most of the CTE written, starting with SELECT A.ParentWF_ExecutionID By using a CTE instead of a table variable, you will be allowing SQL Server to see statistics that estimate the actual number of rows (as opposed to SQL Server assuming just one row) The same applies to #RequestDetailedList, even though SQL Server will recompile this batch (to get the current statistics for #RequestDetailedList) at certain thresholds (triggered by data inserts into #RequestDetailedList).

It may appear INSERT INTO @WFdetails has SELECTed an order by SWED.Id desc, but relational database tables (and table variables) never have a defined order. You must always SELECT the order, even if it appears you had INSERTed an "order". That is true even if you run a simple SELECT * FROM @WFdetails (without an ORDER BY clause) and you see results in the expected order. Unsure whether this relates to unexpected results, but I doubt this is a performance concern.

Short-circuiting predicates (i.e. WHERE clauses and JOIN clauses), such as AND (@ServiceName IS NULL OR RD.productId=@ServiceId) --If we wanted to target a specific product or request

AND (@ParentWFName IS NULL OR SWE.Name=@ParentWFName)

will frequently introduce poor performance. See Dynamic Search Conditions in T-SQL. In particular section "4. Static SQL with OPTION (RECOMPILE)" Given there are only 4 permutations of the above predicates (the permutations are @ServiceName IS NULL AND @ParentWFName IS NULL, @ServiceName IS NOT NULL AND @ParentWFName IS NULL, @ServiceName IS NULL AND @ParentWFName IS NOT NULL, and @ServiceName IS NOT NULL AND @ParentWFName IS NOT NULL), you could move those variables into 4 control-of-flow IF statements. Doing so would allow SQL Server to compile 4 execution plans, each tailored for each of the 4 permutations.

Using UDF's in predicates, such aswhere SWED.ExecutionID in (

SELECT exid FROM [PMGSPE].dbo.udf_get_related_workflows2(@WFId) --this UDF returns a table containing the parent Workflow and all children under parent

is also risky (or not, depending upon its DDL). I suspect a CTE could be used to replace the UDF, JOINed with SWED.ExecutionID. By using a CTE instead of a UDF, a future developer (or you) would be able to consider all objects accessed by this batch/stored procedure (without having to inspect the DDL for the UDF).

Given that the previous author used the NOLOCK hint on certain tables, I suspect those tables are where your performance concern actually exists (as opposed to the cursor's rather simple SELECT). I would consider whether the NOLOCK'ed predicates

RDL.execution_id = LastWF.ParentWF_ExecutionID and SWE.ID=RDL.execution_id

and

on SWED.ExecutionID = SWE.ID

where SWED.ExecutionID in (

SELECT exid FROM dbo.udf_get_related_workflows2(RDL.execution_id)

) and WHERE createDate >= @StartDate

AND createDate <= @EndDate

AND (@ServiceName IS NULL OR RD.productId=@ServiceId) --If we wanted to target a specific product or request

AND (@ParentWFName IS NULL OR SWE.Name=@ParentWFName)

are indexed, and create indexes (to test performance) if they are not.

If you are instead interested in getting a quick answer (about where a performance problem exists), BCP OUT the data (or backup the database), script out the tables (if not using a backup), find an otherwise very inactive SQL Server, create or restore a test database (if not restored, run the scripts), BCP IN the data, open SSMS query window, select @@SPID, configure a SQL Server Profiler trace to collect the SP:STmtCompleted event (filtered by the @@SPID), start the SQL Server Profiler trace, run the batch (or stored procedure), stop the trace, and find the maximum duration in the trace. I suspect you will find the longest duration will be spent inserting into #RequestDetailedList, inserting into @WFdetails, or selecting into #Results (i.e. not the cursor's select from #RequestDetailedList). The answer depends upon how many rows are in each table, the table variable or the temporary table, and what SQL Server knows or doesn't know about the expected row counts within them.

It may be tempting (because it is extremely quick) to use SSMS to generate a Estimated Query Plan for the T-SQL as written. Unfortunately, an Estimated Query Plan will not reveal missing indexes and will not display accurate cost estimates. Those will not happen because the T-SQL as written uses temporary tables and table variables (which are not populated until run time, well after the estimated plan is created). By replacing them with CTEs, the estimated plan will be more accurate and a missing index might be suggested (that would not be suggested when table variables and temporary tables are used). There is still the problem of variable values not being known to SQL Server at the time when the estimated plan is compiled and displayed, but that can be partially overcome by replacing variables with a typical value, expressed as a literal value (such as WHERE SWE.Name='A Typical Parent Workflow Name' instead of WHERE SWE.Name=@ParentWFName). With the T-SQL as written, inspecting its estimated query plans will be misleading. In contrast, a trace of the T-SQL as written (when run against realistic data) will focus your efforts, by more reliably exposing needed performance fixes.

In short, I think you risk wasting precious time by rewriting the cursor, because I think the batch's salient performance problem does not concern the cursor:cool:

Steven Aiello SSChasing Mays Points: 604 More actions · Answer 2

Steven Aiello

SSChasing Mays

Points: 604

November 13, 2015 at 7:19 am

#1839604

Thanks for the suggestions.

MadAdmin SSChampion Points: 11261 More actions · Answer 3

I see scalar UDF's in the where clause operating on a column and not a variable.

Convert the UDF to a TVF, then cross apply onto the TVF.

Not just any TVF, a single statement TVF.

Catch-all queries done right [/url]
Gail Shaw's Performance Blog[/url]

serg-52 SSCrazy Eights Points: 9913 More actions · Answer 4

Quite formally you may first refactor cursor part of code to

INSERT INTO @WFdetails

select RDL.execution_id as ParentWF_ExecutionID,

zzz.LastWF_Name,

...

from #RequestDetailedList RDL

cross apply (

SELECT TOP 1 SWE.Name as

, SWE.InternalStatus as LastWF_InternalStatus

, SWED.ExecutionID as LastWF_ExecutionID

, Action as LastWF_ActionName

FROM [dbo].WorkflowDetails SWED with (nolock)

INNER join [dbo].[Workflow] SWE with (nolock)

on SWED.ExecutionID = SWE.ID

where SWED.ExecutionID in (

SELECT exid FROM dbo.udf_get_related_workflows2(RDL.execution_id) --this UDF returns a table containing the parent Workflow and all children under parent

) order by SWED.Id desc

) zzz

and then try to optimize it.

Steven Aiello SSChasing Mays Points: 604 More actions · Answer 5

Thanks Serg and MadAdmin....I took your suggestions and did a similar approach based on the specific outcome I was looking for and I got it to work.

Much appreciated!