Performance difference for stored proc in SSIS vs SSMS (and other questions)

Question

Performance difference for stored proc in SSIS vs SSMS (and other questions)

dbodell

Right there with Babe

Points: 742
More actions
May 8, 2019 at 3:44 pm

#3638379

I have a humongous data set that I am updating from another humongous data set. We are talking about hundreds of millions of rows being updated. Apart from partitioning these tables (which are processed by single year batches) I have tuned the update query and indexed the tables so that the execution plan looks very reasonable. However, when I run the stored procedure as a SQL task step in SSIS it hangs for hours and I usually have to abort it. I tried putting the code directly in the task instead of doing an EXEC for the stored proc, but that didn't help performance either. Then I decided to run the stored procedure manually in SSMS and it only took 20 minutes.
Why would a stored procedure perform so miserably as a result of being called by SSIS vs being executed manually (same server in both cases .. I am running it manually while on the same server)?
Second question: As I mentioned, I am considering partitioning the tables being used because this data is processed in yearly batches. However, any given year could have hundreds of millions of rows. I've been reading about partitioning and have done it once or twice on large tables but never something this large. In some previous situations the partitioning did improve my ETL process while in others it did not. I'm a bit cautious about doing these huge ones because you cannot undo it and if it isn't going to help, I at least don't want it to hurt anything.
Finally, would it be a good idea to use a clustered columnstore index? Would a filtered non-clustered index be an alternative?
Thanks in advance for your guidance.

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 1

Why do "hundred of millions of rows" need to be updated? If we knew what the table was for, what the structure was, and the need to do the update, we might be able to help better. I recommend you post the CREATE TABLE statement for the table being sure to include all constraints, keys (FK's, as well), and any other amplifying information especially what the table is being updated for and what the source of data is for the update.

As for why SSIS is taking so long, I can only imagine because I don't use it and will have to defer to others as to what you may have done wrong or what the problem is with SSIS for this problem. Of course, they're facing the exact same problem the rest of us are facing... we simply don't have enough information from you to help you.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sue_H SSC Guru Points: 90860 More actions · Answer 2

dbodell wrote:

Then I decided to run the stored procedure manually in SSMS and it only took 20 minutes. Why would a stored procedure perform so miserably as a result of being called by SSIS vs being executed manually (same server in both cases .. I am running it manually while on the same server)?

I think the best article on this subject is this one by Erland Sommarskog:

Slow in the Application, Fast in SSMS? Understanding Performance Mysteries

Sue

dbodell Right there with Babe Points: 742 More actions · Answer 3

Jeff, thanks for the reply. So this data mart holds tables that store records of visits to healthcare facilities for all of Ohio. The Ohio Health Association sends out these mongo csv files once per quarter and I have to process them. I ingest the data just fine and it doesn't take long . It is a single delete of all the records for a given year out of each table followed by batched inserts of replacement records for that year for each table. In some cases I drop non-clustered indexes first. I have to process up to two years' worth of data each quarter and I do them one year at a time. I'd be happy to explain more about this part but the part that is horrible is when I have to read all of the CPT code records that were ingested for that year and then use an "algorithm" and sorting to come up with a ranking that is commonly used in healthcare for difficulty level of the procedure. Every single visit record for that year must be updated with the resulting rank. Hence the "hundred of millions of rows" being updated. I have attached my file and index definitions and the stored procedure, which is pretty simple. I also noted the number of records in each table.

Again, I just want to understand why it would perform adequately in SSMS but not in SSIS.

This environment is SQL Server 2016 and all of the data resides on the same server, in the same database.

I hope this and the attachments helps in clarifying what I am doing.

begin

    update oha.fact_OHAEncounter
    set Rank_id = isnull(rnk.Rank_ID, -1)
    from oha.fact_OHAEncounter                 as ENC
        --- ONLY PROCESS ENCOUNTER THAT HAVE BEEN STAGED
        inner join stg.fact_OHAEncounter_stage as stg
--- all columns on index
            on stg.RowID = ENC.RowId
               and stg.FileYear = ENC.FileYear
               and stg.FileType = ENC.FileType
        left join
        (
            select CPT.OHAEncounterKey
                 , max(rnk.Rank) as HighRank
            from oha.fact_OHACPT                 as CPT
                inner join stg.fact_OHACPT_stage as stg
                    on stg.RowID = CPT.RowID
                       and stg.FileYear = CPT.FileYear
                       and stg.FileType = CPT.FileType
                left join dss.dim_CPT4           as DCPT
                    on CPT.OHACPTCode_id = DCPT.CPT_ID
                left join dss.dim_Rank           as rnk
                    on rnk.Rank_ID = DCPT.Rank_ID
            where CPT.RowID > 0  --- to use the index
                  and CPT.FileType in ( 'OP_', 'OPC', 'OPL', 'OPO' ) -- CPT.RowId > 0

            group by CPT.OHAEncounterKey
        )                                      as DATA
            on ENC.OHAEncounterkey = DATA.OHAEncounterKey
        left join dss.dim_Rank                 as rnk
            on DATA.HighRank = rnk.Rank
    where ENC.RowId > 0
          and ENC.FileYear >= '2017'
          and ENC.FileType in ( 'OP_', 'OPC', 'OPL', 'OPO' );
end;

/*************************************************************
table for the visits (encounters) this is the table being updated -- 182,906,201 records
*************************************************************


CREATE TABLE [oha].[fact_OHAEncounter](
[RowId] [int] NOT NULL,
[Filename] [nvarchar](100) NOT NULL,
[FileYear] [nvarchar](4) NOT NULL,
[FileType] [nvarchar](3) NOT NULL,
[OHAHospital_ID] [int] NOT NULL,
[OHAEncounterTypeCode_ID] [int] NOT NULL,
[OHAPlaceOfServiceCode_id] [int] NOT NULL,
[AdmissionDate] [date] NULL,
[AdmissionDateKey] [int] NOT NULL,
[OHAAdmissionSourceCode_id] [int] NULL,
[OHAAdmissionTypeCode_ID] [int] NOT NULL,
[Age] [smallint] NULL,
[CountyFipsCode] [varchar](5) NULL,
[DischargeDate] [date] NULL,
[DischargeDateKey] [int] NOT NULL,
[OHADischargeStatusCode_ID] [int] NOT NULL,
[DateOfBirth] [date] NULL,
[OHAEthnicityCode_ID] [int] NOT NULL,
[OHALanguageCode_ID] [int] NOT NULL,
[OHALengthOfStayDays] [int] NULL,
[MSDRG_ID] [int] NOT NULL,
[OHAMSDrgProductLineCode] [varchar](3) NULL,
[OHAMdcCode] [varchar](3) NULL,
[OHAPayerGroupCode_ID] [int] NOT NULL,
[OHARaceCode_ID] [int] NOT NULL,
[OHASexCode_ID] [int] NOT NULL,
[OHATotalCharges] [money] NOT NULL,
[OHAPostalCode_ID] [int] NOT NULL,
[CancerSite_ID] [int] NULL,
[JointReplacementType] [varchar](40) NULL,
[DiabetesFlag] [bit] NULL,
[MajorCardiacFlag] [bit] NULL,
[Rank_id] [int] NULL,
[Package Name] [nvarchar](64) NOT NULL,
[Execution Start Time] [datetime] NOT NULL,
[Machine Name] [nvarchar](15) NOT NULL,
[Task Name] [nvarchar](64) NOT NULL,
[User Name] [nvarchar](64) NOT NULL,
[OHAEncounterkey] [nvarchar](20) NOT NULL,
[EDFlag] [tinyint] NOT NULL,
[SurgeryFlag] [tinyint] NULL,
[MinorSurgeryFlag] [tinyint] NULL,
[BariatricFlag] [tinyint] NULL,
[ObservationFlag] [tinyint] NULL,
[RecurringFlag] [tinyint] NULL,
[OnOffSiteType] [varchar](50) NULL,
[TraumaFlag] [tinyint] NULL,
 CONSTRAINT [PK_fact_OHAEncounter] PRIMARY KEY CLUSTERED 
(
[OHAEncounterkey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 85) ON [FG2009]
) ON [FG2009]
GO



--- non-clustered indexes on this table
CREATE NONCLUSTERED INDEX [KDX_OHADischargeDateKey_cover] ON [oha].[fact_OHAEncounter]
(
[DischargeDateKey] ASC
)
INCLUDE ( [OHAEncounterTypeCode_ID],
[MSDRG_ID],
[OHAEncounterkey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [FG2009]
GO

---- non-clustered index specifically for the RANK update process
CREATE NONCLUSTERED INDEX [KDX_KEYCOLUMNS_RANK] ON [oha].[fact_OHAEncounter]
(
[RowId] ASC,
[FileYear] ASC,
[FileType] ASC
)
INCLUDE ( [Rank_id]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 95) ON [FG2015]
GO

/*******************************************************************
table for the CPT codes -- this is the table that is used to determine rank -- 858,045,215 rows
*******************************************************************/



CREATE TABLE [oha].[fact_OHACPT](
[IdentityRow] [bigint] IDENTITY(1,1) NOT NULL,
[RowID] [int] NOT NULL,
[FileName] [nvarchar](100) NULL,
[FileYear] [nvarchar](4) NOT NULL,
[FileType] [nvarchar](3) NOT NULL,
[OHACPTCode_id] [int] NOT NULL,
[OHACPTDate] [date] NOT NULL,
[OHACPTUnitOfService] [varchar](50) NULL,
[Package Name] [nvarchar](64) NOT NULL,
[Execution Start Time] [datetime] NOT NULL,
[Machine Name] [nvarchar](15) NOT NULL,
[Task Name] [nvarchar](64) NOT NULL,
[User Name] [nvarchar](64) NOT NULL,
[OHAEncounterKey] [nvarchar](20) NOT NULL,
 CONSTRAINT [PK_fact_OHACPT] PRIMARY KEY CLUSTERED 
(
[FileYear] ASC,
[IdentityRow] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 85) ON [FG2013]
) ON [FG2013]
GO

--- non-clustered index on this table
SET ANSI_PADDING ON
GO

/****** Object:  Index [KDX_ROWID_FILETYPE_COVER]    Script Date: 5/8/2019 1:47:33 PM ******/
CREATE NONCLUSTERED INDEX [KDX_ROWID_FILETYPE_COVER] ON [oha].[fact_OHACPT]
(
[RowID] ASC,
[FileType] ASC
)
INCLUDE ( [FileYear],
[OHACPTCode_id],
[OHAEncounterKey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [FG2013]
GO

/**************************************************
dimension dim_CPT4 used in process

*************************************************/


CREATE TABLE [dss].[dim_CPT4](
[CPT_ID] [int] IDENTITY(0,1) NOT NULL,
[CPT4Code] [varchar](50) NOT NULL,
[CPT4CodeName] [varchar](100) NULL,
[CPT4Catg] [varchar](100) NULL,
[Package Name] [nvarchar](64) NOT NULL,
[Execution Start Time] [datetime] NOT NULL,
[Machine Name] [nvarchar](15) NOT NULL,
[Task Name] [nvarchar](64) NOT NULL,
[User Name] [nvarchar](64) NOT NULL,
[Rank_ID] [int] NULL,
[CPT4KPNCategory] [varchar](100) NULL,
 CONSTRAINT [PK_dim_CPT4] PRIMARY KEY CLUSTERED 
(
[CPT_ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [FG2010]
) ON [FG2010]
GO

--- non clustered index using business key
SET ANSI_PADDING ON
GO

/****** Object:  Index [IDX_CPT4CODE_COVER]    Script Date: 5/8/2019 1:52:01 PM ******/
CREATE UNIQUE NONCLUSTERED INDEX [IDX_CPT4CODE_COVER] ON [dss].[dim_CPT4]
(
[CPT4Code] ASC
)
INCLUDE ( [CPT_ID],
[CPT4CodeName],
[CPT4KPNCategory]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [FG2011]
GO

/**************************************************
dimension dim_Rank used in process

*************************************************/


CREATE TABLE [dss].[dim_Rank](
[Rank_ID] [int] IDENTITY(-1,1) NOT NULL,
[Rank] [int] NOT NULL,
[CPT4ServiceLine] [varchar](254) NULL,
[CPT4SubServiceLine] [varchar](254) NULL,
[Package name] [nvarchar](64) NOT NULL,
[Execution start time] [datetime] NOT NULL,
[Machine name] [nvarchar](15) NOT NULL,
[Task name] [nvarchar](64) NOT NULL,
[User name] [nvarchar](64) NOT NULL,
[CPT4ProcedureLevel] [varchar](254) NULL,
 CONSTRAINT [PK_dim_Rank] PRIMARY KEY CLUSTERED 
(
[Rank_ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [FG2010]
) ON [FG2010]
GO

--- nonclustered index for business key

/****** Object:  Index [KDX_RANK_COVER]    Script Date: 5/8/2019 1:54:06 PM ******/
CREATE NONCLUSTERED INDEX [KDX_RANK_COVER] ON [dss].[dim_Rank]
(
[Rank] ASC
)
INCLUDE ( [Rank_ID]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [FG2010]
GO

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 4

Something is seriously wrong with my connections to SSC... I don't see any files attached.

@sue... are you seeing any? This might be associated with the same problem as me not being able to get to the BBCode link.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sue_H SSC Guru Points: 90860 More actions · Answer 5

Hey Jeff -

I don't think there are any attachments. There are scripts posted by the OP, not any attachments. All the DDL and the query are there in two different scripts posted. The query and then another set of all the DDL. Do you see those? Two different script boxes or whatever you want to call that.

Sue

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 6

Thanks, Sue. Thanks for confirming what I was seeing. As a bit of a sidebar, I don't even know how to do attachments on this site since the migration.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jonathan AC Roberts SSCoach Points: 18408 More actions · Answer 7

I've got no idea why the stored procedure would be slower when called from SSIS instead of SSMS. But when I get a SQL statement like that I try to split it to get better performance. In the case of that update I would try rewriting it like this:

SELECT CPT.OHAEncounterKey, 
       MAX(rnk.Rank) AS HighRank
  INTO #DATA
  FROM oha.fact_OHACPT AS CPT
 INNER JOIN stg.fact_OHACPT_stage AS stg
         ON stg.RowID = CPT.RowID
        AND stg.FileYear = CPT.FileYear
        AND stg.FileType = CPT.FileType
  LEFT JOIN dss.dim_CPT4 AS DCPT ON CPT.OHACPTCode_id = DCPT.CPT_ID
  LEFT JOIN dss.dim_Rank AS rnk ON rnk.Rank_ID = DCPT.Rank_ID
 WHERE CPT.RowID > 0  --- to use the index
   AND CPT.FileType IN('OP_', 'OPC', 'OPL', 'OPO') -- CPT.RowId > 0
 GROUP BY CPT.OHAEncounterKey
--
CREATE CLUSTERED INDEX IX_#DATA ON #DATA(OHAEncounterKey);
--
UPDATE oha.fact_OHAEncounter
   SET Rank_id = isnull(rnk.Rank_ID, -1)
  FROM oha.fact_OHAEncounter AS ENC
   --- ONLY PROCESS ENCOUNTER THAT HAVE BEEN STAGED
 INNER JOIN stg.fact_OHAEncounter_stage AS stg --- all columns on index
         ON stg.RowID = ENC.RowId
        AND stg.FileYear = ENC.FileYear
        AND stg.FileType = ENC.FileType
  LEFT JOIN #DATA AS DATA 
         ON ENC.OHAEncounterkey = DATA.OHAEncounterKey
  LEFT JOIN dss.dim_Rank AS rnk ON DATA.HighRank = rnk.Rank
 WHERE ENC.RowId > 0
   AND ENC.FileYear >= '2017'
   AND ENC.FileType IN('OP_', 'OPC', 'OPL', 'OPO');

Sue_H SSC Guru Points: 90860 More actions · Answer 8

I just realized after you mentioned it, there was some button on the toolbar for attachments but I don't see it anymore. And I have both toolbars displaying - the upper and lower one. That could be a bug. I'll try playing with it a bit later to see if it's a PEBKAC or not.

Sue

dmbaker SSCertifiable Points: 5144 More actions · Answer 9

Regarding why it's slower in SSIS...SSIS needs to "validate" your query (in this case a stored procedure). It has to get metadata about the query in order to validate that the fields you're using match up with the flow and the target. I'm betting that it's the validation step that's taking a long time, and the actual execution of the stored proc is about the same as what you'd see in SSMS. I'm not sure what SSIS is actually doing behind the scenes...is it executing the proc to get metadata, then executing the proc again to actually run it? I don't know but that seems to be what happens, I've seen this same thing happen and when running in debug mode (in Visual Studio), you can see (via the "progress" tab) that it seems to be "stuck" on validating...it'll get to some percentage in the validation then sit for a bit, then eventually continue.

Not sure if that's what's happening to you but I'm betting it is.

dmbaker SSCertifiable Points: 5144 More actions · Answer 10

I guess the follow-up question is "what can I do about this?" I don't know the answer to that unfortunately. I don't use procs that update millions of rows (usually) so though this is an issue for me it's not been a showstopper. If anyone else can find an answer please post it!

pstallin Grasshopper Points: 17 More actions · Answer 11

A few years ago I was processing a large number of records (millions?, may have been an update), and I discovered that if I ran it in the development environment (BIDS/VS?), that it did lots of logging in the Progress tab. The longer it ran, the slower it ran, until after a day, it was crawling. If instead I scheduled the job and let the scheduler kick it off the job finished in less than an hour. I'm afraid I don't recall what version I was running, but probably 2008.

andy.gray SSC Enthusiast Points: 108 More actions · Answer 12

How are you running the stored procedure in SSIS - in an Execute SQL Task?

As someone said, it is probably retrieving metadata.

I found something similar with an OLEDB Source a couple of weeks ago. If I chose the source type as "Table or View" and selected the view from the list, I could see SSIS running "SELECT * FROM <view>" four times . The first three times took a few minutes each, the final time took a few seconds just like it did if I ran it in SSMS.

When I changed the source type to be SQL and entered the same query "SELECT * FROM <view>", SSIS ran it once taking a few seconds.