December 28, 2014 at 3:58 pm
Hey everyone,
I am currently trying to optimize an ETL process that pulls around 150 tables from a OLAP-esque system through a linked server. It's the AllScripts Touchworks EMR database, if that makes a difference. I am having a difficult time fitting an incremental load into a reasonable window of time. For instance, one table, which has around 40 columns, and a total of about 50 million rows, takes around an hour just to pull one day's worth of records when filtering on a single datetime field. The query is:
SELECT *
FROM OPENQUERY([linkedserver],
'
SELECT *
FROM Works.dbo.Visit
WHERE StartDttm BETWEEN ''2014-12-23'' AND ''2014-12-24''
'
)
Which, in the end, returns only about 30k records. However, in trying different approaches, I decided to test out bcp using the following code:
bcp Works.dbo.Visit out f:\etl\visit.dat -n -S linkedserver
this extracted the entire table to flat file, but it only took about 15 minutes....add on another 3-4 to load that into the database on my end. Now, my ability to see what's happening on the remote server's end is very limited, but I know that the server has much higher specs than mine. They have 96GB memory vs. my 32GB, twice the cores I've got, etc. So how is this running faster by pulling the entire table over?
I figured that even in the worst case scenario of a full table scan, picking 30k rows out of the table that match a single criteria would certainly run faster than bringing the whole thing over. This isn't isolated to a single table, either. I finished running a test of the incremental extract/load vs. full extract/load via bcp (200GB of data). The incremental takes about 7 hours, bcp takes about 2 hours.
Any ideas on what I'm experiencing or what a better approach would be?
December 28, 2014 at 5:34 pm
bantrim (12/28/2014)
Hey everyone,I am currently trying to optimize an ETL process that pulls around 150 tables from a OLAP-esque system through a linked server. It's the AllScripts Touchworks EMR database, if that makes a difference. I am having a difficult time fitting an incremental load into a reasonable window of time. For instance, one table, which has around 40 columns, and a total of about 50 million rows, takes around an hour just to pull one day's worth of records when filtering on a single datetime field. The query is:
SELECT *
FROM OPENQUERY([linkedserver],
'
SELECT *
FROM Works.dbo.Visit
WHERE StartDttm BETWEEN ''2014-12-23'' AND ''2014-12-24''
'
)
Which, in the end, returns only about 30k records. However, in trying different approaches, I decided to test out bcp using the following code:
bcp Works.dbo.Visit out f:\etl\visit.dat -n -S linkedserver
this extracted the entire table to flat file, but it only took about 15 minutes....add on another 3-4 to load that into the database on my end. Now, my ability to see what's happening on the remote server's end is very limited, but I know that the server has much higher specs than mine. They have 96GB memory vs. my 32GB, twice the cores I've got, etc. So how is this running faster by pulling the entire table over?
I figured that even in the worst case scenario of a full table scan, picking 30k rows out of the table that match a single criteria would certainly run faster than bringing the whole thing over. This isn't isolated to a single table, either. I finished running a test of the incremental extract/load vs. full extract/load via bcp (200GB of data). The incremental takes about 7 hours, bcp takes about 2 hours.
Any ideas on what I'm experiencing or what a better approach would be?
Yes. The BCP thing is an excellent idea but you're using the wrong method. Instead of BCPing out the whole table, use the QUERYOUT option with a query that's nearly identical to the one you used for OpenQuery.
The OpenQuery thing might be able to be execute more quickly. Does the StartDttm column on the remote table have an index on it?
You could also make the BCP go faster if it logged into the remote server directly using a "Trusted" connection so that you don't need to use the login info in clear text.
Last but not least, you've reinvented the poor-mans version of replication. It might be easier if you setup replication on the remote server.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 28, 2014 at 6:19 pm
Jeff,
Thank you for your reply. I haven't tested the queryout option against the incremental query, but I guess I had assumed that I would run into the same issue as I did in the original query. Certainly, if I could get bcp's speed (or close to) in the query, I would be a happy guy.
Oh, and StartDttm doesn't have an index on it.
As for replication...well, it's kind of a long story. We have our EMR actually hosted by another medical organization...one with a very underfunded IS department. I've worked my way down many options, including replication, and my requests for getting anything setup on their end has failed. Actually, I know that TouchWorks updates the OLAP dB by exporting the transaction log from the OLTP environment each day, so the easiest option , I figured, would be to do an initial load and then get a copy of said log each night. No dice...
Thanks for the suggestions! I'll run a comparison with queryout and report back.
December 28, 2014 at 6:54 pm
bantrim (12/28/2014)
Jeff,Thank you for your reply. I haven't tested the queryout option against the incremental query, but I guess I had assumed that I would run into the same issue as I did in the original query. Certainly, if I could get bcp's speed (or close to) in the query, I would be a happy guy.
Oh, and StartDttm doesn't have an index on it.
As for replication...well, it's kind of a long story. We have our EMR actually hosted by another medical organization...one with a very underfunded IS department. I've worked my way down many options, including replication, and my requests for getting anything setup on their end has failed. Actually, I know that TouchWorks updates the OLAP dB by exporting the transaction log from the OLTP environment each day, so the easiest option , I figured, would be to do an initial load and then get a copy of said log each night. No dice...
Thanks for the suggestions! I'll run a comparison with queryout and report back.
I don't know if the QueryOut option will actually help against the linked server. Any way you could "go direct" with the BCP? And is there any way that you could "push" an index onto the column so it stands a chance of performance?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 28, 2014 at 9:31 pm
Jeff Moden (12/28/2014)
bantrim (12/28/2014)
Jeff,Thank you for your reply. I haven't tested the queryout option against the incremental query, but I guess I had assumed that I would run into the same issue as I did in the original query. Certainly, if I could get bcp's speed (or close to) in the query, I would be a happy guy.
Oh, and StartDttm doesn't have an index on it.
As for replication...well, it's kind of a long story. We have our EMR actually hosted by another medical organization...one with a very underfunded IS department. I've worked my way down many options, including replication, and my requests for getting anything setup on their end has failed. Actually, I know that TouchWorks updates the OLAP dB by exporting the transaction log from the OLTP environment each day, so the easiest option , I figured, would be to do an initial load and then get a copy of said log each night. No dice...
Thanks for the suggestions! I'll run a comparison with queryout and report back.
I don't know if the QueryOut option will actually help against the linked server. Any way you could "go direct" with the BCP? And is there any way that you could "push" an index onto the column so it stands a chance of performance?
I'm not 100% in the actual arrangement, but I do know that, through connecting via an IP address I was given, I can execute directly against their server, which is what I am using with BCP. Unfortunately, my requests at having any indices built have been denied as well. Our permissions are locked down to the extent that I can't even see execution plans.
I don't want to make this too confusing, but another reason why I began to lean towards the BCP full-table pull is because, through many days of querying, I have found that the datetime fields (i.e. CreateDttm, UpdateDttm, etc.) in the tables aren't 100%, in that they don't always get updated/populated when a change/insert occurs. Beyond this, since we are "sharing" this EMR with another medical facility, all of our data is mashed together, but we only are allowed to pull data for patients who have been to one of our facilities. So, if a patient goes to their facilities for years, and then goes to ours one day, I will need to pull that patient's entire history for all tables.
All of these things combined are starting to make me think maybe I just need to find a way to make the full refresh approach work.
December 28, 2014 at 11:48 pm
I do understand about not being able to see anything of what you need. The full table pulls using BCP probably are the best way to go for the short haul.
What I think needs to happen is that people from the various groups need to understand that the data has to be shared in order for everyone to do their job. If it were me, I believe I'd see if I could start a regular user's meeting with the people that control the data and see if there are ways to make this a little easier for everyone instead of the every-man-for-himself type of thing that you're currently having to go through. The good part is that you at least have access to pull the data you need and setting up a couple of jobs to pull the data, maybe in parallel, would be a good idea. If you know you need all of the data sometimes, then I'd just go an pull all of the data all the time.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 29, 2014 at 12:17 pm
Jeff Moden (12/28/2014)
I do understand about not being able to see anything of what you need. The full table pulls using BCP probably are the best way to go for the short haul.What I think needs to happen is that people from the various groups need to understand that the data has to be shared in order for everyone to do their job. If it were me, I believe I'd see if I could start a regular user's meeting with the people that control the data and see if there are ways to make this a little easier for everyone instead of the every-man-for-himself type of thing that you're currently having to go through. The good part is that you at least have access to pull the data you need and setting up a couple of jobs to pull the data, maybe in parallel, would be a good idea. If you know you need all of the data sometimes, then I'd just go an pull all of the data all the time.
I couldn't agree more. This is good advice. I'm just putting the finishing touches on a 4-thread process to do just that, but just met with the director on our side over getting such a user group formed for both sides to come together.
Oh, and I did finish testing the queryout vs openquery comparison. There really didn't seem to be much (if any) improvement. Though, it's hard to say since the time it takes to extract is so variable. It was worth a shot!
Thanks again Jeff
December 31, 2014 at 2:20 pm
Do you have access to SSIS? You could query the data with the use of a data source component, of the dataflow task. It is much faster than a linked server I'll tell you that.
----------------------------------------------------
December 31, 2014 at 2:22 pm
MMartin1 (12/31/2014)
Do you have access to SSIS? You could query the data with the use of a data source component, of the dataflow task. It is much faster than a linked server I'll tell you that.
Yes, I use SSIS quite a bit, but it's much slower than using bcp for this purpose.
December 31, 2014 at 4:21 pm
One of the tweaks I do is alter the max rows per batch and or commit size when transferring millions or rows with SSIS's, dataflow's ole db destination. This optimises the ETL, though you of have to play with it a bit though to get just the right setting. Plus here you can handle problem rows in realtime rather than importing to a staging then doing your checks, practically speaking. (which translates to time saved of course).
----------------------------------------------------
December 31, 2014 at 4:37 pm
MMartin1 (12/31/2014)
One of the tweaks I do is alter the max rows per batch and or commit size when transferring millions or rows with SSIS's, dataflow's ole db destination. This optimises the ETL, though you of have to play with it a bit though to get just the right setting. Plus here you can handle problem rows in realtime rather than importing to a staging then doing your checks, practically speaking. (which translates to time saved of course).
But how does this compare to using bcp to extract the table? I played with the rows per batch and commit size, but the fastest I could get was around 100k rows\second, while with bcp I can get around 400k rows\second.
January 1, 2015 at 4:32 pm
bantrim (12/31/2014)
MMartin1 (12/31/2014)
One of the tweaks I do is alter the max rows per batch and or commit size when transferring millions or rows with SSIS's, dataflow's ole db destination. This optimises the ETL, though you of have to play with it a bit though to get just the right setting. Plus here you can handle problem rows in realtime rather than importing to a staging then doing your checks, practically speaking. (which translates to time saved of course).But how does this compare to using bcp to extract the table? I played with the rows per batch and commit size, but the fastest I could get was around 100k rows\second, while with bcp I can get around 400k rows\second.
As I had mentioned you could do transformations in the ETL pipeline with SSIS , and bypass a staging table making it a huge timesaver. An expensive part of a ETL are the reads and writes. Plus there are added benefits of SSIS, a big one being logging. I am offering these thoughts to you, not condeming BCP as it a a viable alternative as well.
----------------------------------------------------
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply