May 8, 2019 at 3:44 pm
I have a humongous data set that I am updating from another humongous data set. We are talking about hundreds of millions of rows being updated. Apart from partitioning these tables (which are processed by single year batches) I have tuned the update query and indexed the tables so that the execution plan looks very reasonable. However, when I run the stored procedure as a SQL task step in SSIS it hangs for hours and I usually have to abort it. I tried putting the code directly in the task instead of doing an EXEC for the stored proc, but that didn't help performance either. Then I decided to run the stored procedure manually in SSMS and it only took 20 minutes.
Why would a stored procedure perform so miserably as a result of being called by SSIS vs being executed manually (same server in both cases .. I am running it manually while on the same server)?
Second question: As I mentioned, I am considering partitioning the tables being used because this data is processed in yearly batches. However, any given year could have hundreds of millions of rows. I've been reading about partitioning and have done it once or twice on large tables but never something this large. In some previous situations the partitioning did improve my ETL process while in others it did not. I'm a bit cautious about doing these huge ones because you cannot undo it and if it isn't going to help, I at least don't want it to hurt anything.
Finally, would it be a good idea to use a clustered columnstore index? Would a filtered non-clustered index be an alternative?
Thanks in advance for your guidance.
May 8, 2019 at 5:25 pm
Why do "hundred of millions of rows" need to be updated? If we knew what the table was for, what the structure was, and the need to do the update, we might be able to help better. I recommend you post the CREATE TABLE statement for the table being sure to include all constraints, keys (FK's, as well), and any other amplifying information especially what the table is being updated for and what the source of data is for the update.
As for why SSIS is taking so long, I can only imagine because I don't use it and will have to defer to others as to what you may have done wrong or what the problem is with SSIS for this problem. Of course, they're facing the exact same problem the rest of us are facing... we simply don't have enough information from you to help you.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 8, 2019 at 5:57 pm
Then I decided to run the stored procedure manually in SSMS and it only took 20 minutes. Why would a stored procedure perform so miserably as a result of being called by SSIS vs being executed manually (same server in both cases .. I am running it manually while on the same server)?
I think the best article on this subject is this one by Erland Sommarskog:
Slow in the Application, Fast in SSMS? Understanding Performance Mysteries
Sue
May 8, 2019 at 6:02 pm
Jeff, thanks for the reply. So this data mart holds tables that store records of visits to healthcare facilities for all of Ohio. The Ohio Health Association sends out these mongo csv files once per quarter and I have to process them. I ingest the data just fine and it doesn't take long . It is a single delete of all the records for a given year out of each table followed by batched inserts of replacement records for that year for each table. In some cases I drop non-clustered indexes first. I have to process up to two years' worth of data each quarter and I do them one year at a time. I'd be happy to explain more about this part but the part that is horrible is when I have to read all of the CPT code records that were ingested for that year and then use an "algorithm" and sorting to come up with a ranking that is commonly used in healthcare for difficulty level of the procedure. Every single visit record for that year must be updated with the resulting rank. Hence the "hundred of millions of rows" being updated. I have attached my file and index definitions and the stored procedure, which is pretty simple. I also noted the number of records in each table.
Again, I just want to understand why it would perform adequately in SSMS but not in SSIS.
This environment is SQL Server 2016 and all of the data resides on the same server, in the same database.
I hope this and the attachments helps in clarifying what I am doing.
begin
update oha.fact_OHAEncounter
set Rank_id = isnull(rnk.Rank_ID, -1)
from oha.fact_OHAEncounter as ENC
--- ONLY PROCESS ENCOUNTER THAT HAVE BEEN STAGED
inner join stg.fact_OHAEncounter_stage as stg
--- all columns on index
on stg.RowID = ENC.RowId
and stg.FileYear = ENC.FileYear
and stg.FileType = ENC.FileType
left join
(
select CPT.OHAEncounterKey
, max(rnk.Rank) as HighRank
from oha.fact_OHACPT as CPT
inner join stg.fact_OHACPT_stage as stg
on stg.RowID = CPT.RowID
and stg.FileYear = CPT.FileYear
and stg.FileType = CPT.FileType
left join dss.dim_CPT4 as DCPT
on CPT.OHACPTCode_id = DCPT.CPT_ID
left join dss.dim_Rank as rnk
on rnk.Rank_ID = DCPT.Rank_ID
where CPT.RowID > 0 --- to use the index
and CPT.FileType in ( 'OP_', 'OPC', 'OPL', 'OPO' ) -- CPT.RowId > 0
group by CPT.OHAEncounterKey
) as DATA
on ENC.OHAEncounterkey = DATA.OHAEncounterKey
left join dss.dim_Rank as rnk
on DATA.HighRank = rnk.Rank
where ENC.RowId > 0
and ENC.FileYear >= '2017'
and ENC.FileType in ( 'OP_', 'OPC', 'OPL', 'OPO' );
end;
/*************************************************************
table for the visits (encounters) this is the table being updated -- 182,906,201 records
*************************************************************
CREATE TABLE [oha].[fact_OHAEncounter](
[RowId] [int] NOT NULL,
[Filename] [nvarchar](100) NOT NULL,
[FileYear] [nvarchar](4) NOT NULL,
[FileType] [nvarchar](3) NOT NULL,
[OHAHospital_ID] [int] NOT NULL,
[OHAEncounterTypeCode_ID] [int] NOT NULL,
[OHAPlaceOfServiceCode_id] [int] NOT NULL,
[AdmissionDate] [date] NULL,
[AdmissionDateKey] [int] NOT NULL,
[OHAAdmissionSourceCode_id] [int] NULL,
[OHAAdmissionTypeCode_ID] [int] NOT NULL,
[Age] [smallint] NULL,
[CountyFipsCode] [varchar](5) NULL,
[DischargeDate] [date] NULL,
[DischargeDateKey] [int] NOT NULL,
[OHADischargeStatusCode_ID] [int] NOT NULL,
[DateOfBirth] [date] NULL,
[OHAEthnicityCode_ID] [int] NOT NULL,
[OHALanguageCode_ID] [int] NOT NULL,
[OHALengthOfStayDays] [int] NULL,
[MSDRG_ID] [int] NOT NULL,
[OHAMSDrgProductLineCode] [varchar](3) NULL,
[OHAMdcCode] [varchar](3) NULL,
[OHAPayerGroupCode_ID] [int] NOT NULL,
[OHARaceCode_ID] [int] NOT NULL,
[OHASexCode_ID] [int] NOT NULL,
[OHATotalCharges] [money] NOT NULL,
[OHAPostalCode_ID] [int] NOT NULL,
[CancerSite_ID] [int] NULL,
[JointReplacementType] [varchar](40) NULL,
[DiabetesFlag] [bit] NULL,
[MajorCardiacFlag] [bit] NULL,
[Rank_id] [int] NULL,
[Package Name] [nvarchar](64) NOT NULL,
[Execution Start Time] [datetime] NOT NULL,
[Machine Name] [nvarchar](15) NOT NULL,
[Task Name] [nvarchar](64) NOT NULL,
[User Name] [nvarchar](64) NOT NULL,
[OHAEncounterkey] [nvarchar](20) NOT NULL,
[EDFlag] [tinyint] NOT NULL,
[SurgeryFlag] [tinyint] NULL,
[MinorSurgeryFlag] [tinyint] NULL,
[BariatricFlag] [tinyint] NULL,
[ObservationFlag] [tinyint] NULL,
[RecurringFlag] [tinyint] NULL,
[OnOffSiteType] [varchar](50) NULL,
[TraumaFlag] [tinyint] NULL,
CONSTRAINT [PK_fact_OHAEncounter] PRIMARY KEY CLUSTERED
(
[OHAEncounterkey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 85) ON [FG2009]
) ON [FG2009]
GO
--- non-clustered indexes on this table
CREATE NONCLUSTERED INDEX [KDX_OHADischargeDateKey_cover] ON [oha].[fact_OHAEncounter]
(
[DischargeDateKey] ASC
)
INCLUDE ( [OHAEncounterTypeCode_ID],
[MSDRG_ID],
[OHAEncounterkey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [FG2009]
GO
---- non-clustered index specifically for the RANK update process
CREATE NONCLUSTERED INDEX [KDX_KEYCOLUMNS_RANK] ON [oha].[fact_OHAEncounter]
(
[RowId] ASC,
[FileYear] ASC,
[FileType] ASC
)
INCLUDE ( [Rank_id]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 95) ON [FG2015]
GO
/*******************************************************************
table for the CPT codes -- this is the table that is used to determine rank -- 858,045,215 rows
*******************************************************************/
CREATE TABLE [oha].[fact_OHACPT](
[IdentityRow] [bigint] IDENTITY(1,1) NOT NULL,
[RowID] [int] NOT NULL,
[FileName] [nvarchar](100) NULL,
[FileYear] [nvarchar](4) NOT NULL,
[FileType] [nvarchar](3) NOT NULL,
[OHACPTCode_id] [int] NOT NULL,
[OHACPTDate] [date] NOT NULL,
[OHACPTUnitOfService] [varchar](50) NULL,
[Package Name] [nvarchar](64) NOT NULL,
[Execution Start Time] [datetime] NOT NULL,
[Machine Name] [nvarchar](15) NOT NULL,
[Task Name] [nvarchar](64) NOT NULL,
[User Name] [nvarchar](64) NOT NULL,
[OHAEncounterKey] [nvarchar](20) NOT NULL,
CONSTRAINT [PK_fact_OHACPT] PRIMARY KEY CLUSTERED
(
[FileYear] ASC,
[IdentityRow] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 85) ON [FG2013]
) ON [FG2013]
GO
--- non-clustered index on this table
SET ANSI_PADDING ON
GO
/****** Object: Index [KDX_ROWID_FILETYPE_COVER] Script Date: 5/8/2019 1:47:33 PM ******/
CREATE NONCLUSTERED INDEX [KDX_ROWID_FILETYPE_COVER] ON [oha].[fact_OHACPT]
(
[RowID] ASC,
[FileType] ASC
)
INCLUDE ( [FileYear],
[OHACPTCode_id],
[OHAEncounterKey]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [FG2013]
GO
/**************************************************
dimension dim_CPT4 used in process
*************************************************/
CREATE TABLE [dss].[dim_CPT4](
[CPT_ID] [int] IDENTITY(0,1) NOT NULL,
[CPT4Code] [varchar](50) NOT NULL,
[CPT4CodeName] [varchar](100) NULL,
[CPT4Catg] [varchar](100) NULL,
[Package Name] [nvarchar](64) NOT NULL,
[Execution Start Time] [datetime] NOT NULL,
[Machine Name] [nvarchar](15) NOT NULL,
[Task Name] [nvarchar](64) NOT NULL,
[User Name] [nvarchar](64) NOT NULL,
[Rank_ID] [int] NULL,
[CPT4KPNCategory] [varchar](100) NULL,
CONSTRAINT [PK_dim_CPT4] PRIMARY KEY CLUSTERED
(
[CPT_ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [FG2010]
) ON [FG2010]
GO
--- non clustered index using business key
SET ANSI_PADDING ON
GO
/****** Object: Index [IDX_CPT4CODE_COVER] Script Date: 5/8/2019 1:52:01 PM ******/
CREATE UNIQUE NONCLUSTERED INDEX [IDX_CPT4CODE_COVER] ON [dss].[dim_CPT4]
(
[CPT4Code] ASC
)
INCLUDE ( [CPT_ID],
[CPT4CodeName],
[CPT4KPNCategory]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [FG2011]
GO
/**************************************************
dimension dim_Rank used in process
*************************************************/
CREATE TABLE [dss].[dim_Rank](
[Rank_ID] [int] IDENTITY(-1,1) NOT NULL,
[Rank] [int] NOT NULL,
[CPT4ServiceLine] [varchar](254) NULL,
[CPT4SubServiceLine] [varchar](254) NULL,
[Package name] [nvarchar](64) NOT NULL,
[Execution start time] [datetime] NOT NULL,
[Machine name] [nvarchar](15) NOT NULL,
[Task name] [nvarchar](64) NOT NULL,
[User name] [nvarchar](64) NOT NULL,
[CPT4ProcedureLevel] [varchar](254) NULL,
CONSTRAINT [PK_dim_Rank] PRIMARY KEY CLUSTERED
(
[Rank_ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [FG2010]
) ON [FG2010]
GO
--- nonclustered index for business key
/****** Object: Index [KDX_RANK_COVER] Script Date: 5/8/2019 1:54:06 PM ******/
CREATE NONCLUSTERED INDEX [KDX_RANK_COVER] ON [dss].[dim_Rank]
(
[Rank] ASC
)
INCLUDE ( [Rank_ID]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [FG2010]
GO
May 8, 2019 at 6:45 pm
Something is seriously wrong with my connections to SSC... I don't see any files attached.
@sue... are you seeing any? This might be associated with the same problem as me not being able to get to the BBCode link.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 8, 2019 at 6:53 pm
Hey Jeff -
I don't think there are any attachments. There are scripts posted by the OP, not any attachments. All the DDL and the query are there in two different scripts posted. The query and then another set of all the DDL. Do you see those? Two different script boxes or whatever you want to call that.
Sue
May 8, 2019 at 6:57 pm
Thanks, Sue. Thanks for confirming what I was seeing. As a bit of a sidebar, I don't even know how to do attachments on this site since the migration.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 8, 2019 at 7:01 pm
I've got no idea why the stored procedure would be slower when called from SSIS instead of SSMS. But when I get a SQL statement like that I try to split it to get better performance. In the case of that update I would try rewriting it like this:
SELECT CPT.OHAEncounterKey,
MAX(rnk.Rank) AS HighRank
INTO #DATA
FROM oha.fact_OHACPT AS CPT
INNER JOIN stg.fact_OHACPT_stage AS stg
ON stg.RowID = CPT.RowID
AND stg.FileYear = CPT.FileYear
AND stg.FileType = CPT.FileType
LEFT JOIN dss.dim_CPT4 AS DCPT ON CPT.OHACPTCode_id = DCPT.CPT_ID
LEFT JOIN dss.dim_Rank AS rnk ON rnk.Rank_ID = DCPT.Rank_ID
WHERE CPT.RowID > 0 --- to use the index
AND CPT.FileType IN('OP_', 'OPC', 'OPL', 'OPO') -- CPT.RowId > 0
GROUP BY CPT.OHAEncounterKey
--
CREATE CLUSTERED INDEX IX_#DATA ON #DATA(OHAEncounterKey);
--
UPDATE oha.fact_OHAEncounter
SET Rank_id = isnull(rnk.Rank_ID, -1)
FROM oha.fact_OHAEncounter AS ENC
--- ONLY PROCESS ENCOUNTER THAT HAVE BEEN STAGED
INNER JOIN stg.fact_OHAEncounter_stage AS stg --- all columns on index
ON stg.RowID = ENC.RowId
AND stg.FileYear = ENC.FileYear
AND stg.FileType = ENC.FileType
LEFT JOIN #DATA AS DATA
ON ENC.OHAEncounterkey = DATA.OHAEncounterKey
LEFT JOIN dss.dim_Rank AS rnk ON DATA.HighRank = rnk.Rank
WHERE ENC.RowId > 0
AND ENC.FileYear >= '2017'
AND ENC.FileType IN('OP_', 'OPC', 'OPL', 'OPO');
May 8, 2019 at 7:08 pm
I just realized after you mentioned it, there was some button on the toolbar for attachments but I don't see it anymore. And I have both toolbars displaying - the upper and lower one. That could be a bug. I'll try playing with it a bit later to see if it's a PEBKAC or not.
Sue
May 13, 2019 at 12:29 pm
Regarding why it's slower in SSIS...SSIS needs to "validate" your query (in this case a stored procedure). It has to get metadata about the query in order to validate that the fields you're using match up with the flow and the target. I'm betting that it's the validation step that's taking a long time, and the actual execution of the stored proc is about the same as what you'd see in SSMS. I'm not sure what SSIS is actually doing behind the scenes...is it executing the proc to get metadata, then executing the proc again to actually run it? I don't know but that seems to be what happens, I've seen this same thing happen and when running in debug mode (in Visual Studio), you can see (via the "progress" tab) that it seems to be "stuck" on validating...it'll get to some percentage in the validation then sit for a bit, then eventually continue.
Not sure if that's what's happening to you but I'm betting it is.
May 13, 2019 at 12:32 pm
I guess the follow-up question is "what can I do about this?" I don't know the answer to that unfortunately. I don't use procs that update millions of rows (usually) so though this is an issue for me it's not been a showstopper. If anyone else can find an answer please post it!
May 14, 2019 at 1:59 pm
A few years ago I was processing a large number of records (millions?, may have been an update), and I discovered that if I ran it in the development environment (BIDS/VS?), that it did lots of logging in the Progress tab. The longer it ran, the slower it ran, until after a day, it was crawling. If instead I scheduled the job and let the scheduler kick it off the job finished in less than an hour. I'm afraid I don't recall what version I was running, but probably 2008.
May 14, 2019 at 2:08 pm
How are you running the stored procedure in SSIS - in an Execute SQL Task?
As someone said, it is probably retrieving metadata.
I found something similar with an OLEDB Source a couple of weeks ago. If I chose the source type as "Table or View" and selected the view from the list, I could see SSIS running "SELECT * FROM <view>" four times . The first three times took a few minutes each, the final time took a few seconds just like it did if I ran it in SSMS.
When I changed the source type to be SQL and entered the same query "SELECT * FROM <view>", SSIS ran it once taking a few seconds.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply