"poor man's replication" ETL process design

Question

"poor man's replication" ETL process design

bantrim

SSCommitted

Points: 1544
More actions
January 10, 2015 at 8:33 pm

#397857

In the word's of Mr. Moden himself, I am designing a poor man's replication, but before I go much further into developing this process, I wanted to throw this out there and either confirm my approach, or find some better ideas.
Here are some facts about the source system I need to replicate on our end:
1.) The tables of concern have anywhere from 20 million to 350 million rows
2.) Though many of the tables have Created and Updated datetime fields, they are not reliable
3.) The source database is located on a server with another company entirely (healthcare industry), and access to it is strictly through a linked server
4.) The server it resides on is a "catch all" server that is used for many purposes, so traffic on it can be variable.
So, after a lot of testing, I have come up with the following approach:
1.) Generate HASHBYTES function for each of the tables that contains all columns, pipe delimited, using SHA1. (i.e. HASHBYTES('SHA1',ISNULL(CAST(ID as varchar(255),'') + '|' + ISNULL(CAST(EntryName as varchar(255), '') + '|' +...)
2.) Add RecordHash binary(20) field to each destination table, populating with HASHBYTES function during initial load.
3.) Using bcp, pull the PK field(s) plus HASHBYTES function result from entire source system table each night.
4.) Load results into temp table, and left join to my table on PK, WHERE (src.RecordHash <> dest.RecordHash) OR (dest.ID IS NULL), selecting just the IDs (the IDs for the deltas) into a separate temp table, with an additional field PulledFlag char(1) DEFAULT 'N'.
5.) Pull 500 IDs at a time, concatenating into a comma delimited list, and using the list to build a dynamic query to pull the records for those 500 records, inserting them into yet another temp table. (i.e. SELECT * FROM OPENQUERY(<server>, 'SELECT *, HASHBYTES('SHA1',.....) FROM dbo.<table> WHERE ID IN (1234,2345,3456,4567,5678,....)')
6.) Flag those 500 records as PulledFlag = 'Y', repeat step 5 for next 500 where PulledFlag = 'N', until NOT EXISTS (SELECT '1' FROM #tmp WHERE PulledFlag = 'N')
7.) Once finished, execute a MERGE statement to Update/Insert records from temp table into destination table.
First questions I have (besides if anyone has a better idea):
1.) In testing, despite my ID temp table having only 3000 IDs or so, attempting to pull the source records by joining the temp table across the linked server took a very long time (50 minutes for 3000 records). Pulling by explicitly using the IDs in dynamic Sql took only 1-2 seconds per 500 rows, which is why I designed step 5 the way I did. However, I don't like it for obvious reasons, and wonder if there is a cleaner, simpler way to handle this.
2.) To make this as efficient as possible, the destination tables should have a clustered index on the PK field(s), and then a nonclustered index on the RecordHash field, right? Would adding the PK field(s) to the nonclustered index change anything, or is that already done in the background? Having a nonclustered index like this would be faster for the comparison than a clustered index on the PK and RecordHash, right?
3.) Is there any reason to worry about indexing the temp table that holds the PK + Hash, or is this moot since the whole thing has to be processed anyway? In testing it didn't seem to make a difference, and just wanted to make sure.
To anyone who has read all of this, thank you for your time. I understand setting up replication would be better in the long run, but getting the company with the source to cooperate is taking a long time and we need this solution in place very soon.
Any and all suggestions are appreciated!

Viewing 15 posts - 1 through 15 (of 31 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 1

The clustered index is always added to the non-clustered index.

I have to say, you've done a remarkable job. You tried the "obvious" method and that took 50 minutes. Rather than putting up with that, you pressed on to find a better way. If you're doing 500 rows in 1 or 2 seconds, your 3000 rows should be done in 15 to 30 seconds instead of 50 minutes.

I wouldn't balk at your CSV hack. It looks like it's doing a pretty good job.

The hash-bytes thing is clever to say the least. How long does that part actually take? It seems to me that you would have to do all of the rows of the remote table and that could actually take quite a while.

An index on the temp table might help but for such a small number of rows, it might take longer to build the CI than the time it would save.

How many total GB is there in the remote tables that you need to pull from?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

bantrim SSCommitted Points: 1544 More actions · Answer 2

Jeff,

Thank you, both for your input, as well as your complements. To answer your questions:

1.) The total size of all tables comes out to around 900 GB over 110 tables, around 2.5 billion rows total.

2.) The HASHBYTES part does take a considerable amount of time. For example, pulling all columns for a 50 million record table took 7.5 minutes, while pulling just ID and Hash took about 6.5 minutes, compared to about 2 minutes for just IDs, and 9 minutes for all columns plus Hash. The real time saver comes in with the import (300 MB file vs. 90 GB file).

I had said in my previous thread that a full truncate\reload seemed to be a good approach, but what killed that idea was when it came to reindexing.

Also, about indexing, having a nonclustered index on RecordHash would be better than adding it to the clustered index, right? I'm really trying to cut the comparison time as much as possible. Starting to think I'm trying to get blood from a turnip though!

Thanks for your help -- I wanted to let you know we have a meeting setup with the other company's director to get a joint user-group going. Thanks for that suggestion too!

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 3

It may be that having the clustered index on the hash would work faster on the lookups, especially if it were a unique index that was guaranteed to be unique by including the PK columns after the hash.

One of the things you might explore in the user meeting is to demonstrate how effective your "poor man's replication" works and how much more effective it could be if the hash values were somehow persisted in the source system (I think it would be easiest to use a persisted column on each table but "sister" tables could be used). That would make your initial downloads for comparison a whole lot faster for everyone because the hash would already have been calculated at little expense on the remote side of the house.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 4

1) you REALLY need to read Thomas Kejser's blog post on hashing in SQL Server: http://kejser.org/exploring-hash-functions-in-sql-server/. I believe he has others.

2) I have used what you are doing for almost 20 years now at a number of clients (quite a few of them healthcare like yours). Works great, mostly. One thing that can really cause problems is collisions, but avoiding that is essentially impossible. You can only reduce the probability.

3) Consider moving the extract and computations out of SQL Server. Purpose-built and coded hashing can be much more efficient than in SQL Server, and you also avoid hitting your main production box unnecessarily.

4) Pursing getting the host system to provide you with just changed records is THE KEY to making your refresh fly. Second would be allowing you to create objects on their system that would allow you to pass in something like a table variable or something similar. That could provide a number of perf benefits.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

bantrim SSCommitted Points: 1544 More actions · Answer 5

Jeff Moden (1/11/2015)
It may be that having the clustered index on the hash would work faster on the lookups, especially if it were a unique index that was guaranteed to be unique by including the PK columns after the hash.
One of the things you might explore in the user meeting is to demonstrate how effective your "poor man's replication" works and how much more effective it could be if the hash values were somehow persisted in the source system (I think it would be easiest to use a persisted column on each table but "sister" tables could be used). That would make your initial downloads for comparison a whole lot faster for everyone because the hash would already have been calculated at little expense on the remote side of the house.

Jeff,

I'll have to do some testing on having the clustered index setup like that.

I do really like the idea of having the hash persisted in the source, as that would certainly cut a huge chunk of the processing time out of this whole thing. Especially knowing that they are attempting (not so well) to do a similar thing on their own end for their BI product.

They're running this on SQL Server 2008R2, so I would think that enabling CDC might be an option, but at this point it seems like too simple of a panacea, and that they certainly MUST have tried that and backed out for a valid reason, right? 😉 Again, I'm not 100% about the process their system goes through on a nightly basis to refresh other than they import the transaction log from the (AllScripts) OLTP system and apply it...and NO, I can't get a copy of that log each night.....because reasons. :angry:

bantrim SSCommitted Points: 1544 More actions · Answer 6

TheSQLGuru (1/11/2015)
1) you REALLY need to read Thomas Kejser's blog post on hashing in SQL Server: http://kejser.org/exploring-hash-functions-in-sql-server/. I believe he has others.
2) I have used what you are doing for almost 20 years now at a number of clients (quite a few of them healthcare like yours). Works great, mostly. One thing that can really cause problems is collisions, but avoiding that is essentially impossible. You can only reduce the probability.
3) Consider moving the extract and computations out of SQL Server. Purpose-built and coded hashing can be much more efficient than in SQL Server, and you also avoid hitting your main production box unnecessarily.
4) Pursing getting the host system to provide you with just changed records is THE KEY to making your refresh fly. Second would be allowing you to create objects on their system that would allow you to pass in something like a table variable or something similar. That could provide a number of perf benefits.

Mr. Guru,

Thank you for the link! I've got that bookmarked now and will be reading it after this post. I was worried about collisions, which is why I moved to the SHA1 algorithm...not perfect, I know...but hopefully better than MD5? I ran a snapshot comparison to make sure the hash always changed when any value changed, and it did. Also check all hashes in each table to see if there were any duplicates, and didn't find any either. That's not to say the second I turn my back, there won't be tons of them 😀

I have the extract setup with bcp, since that seemed to be the fastest option I had come across so far. The computations (hash) were done in the query since I have no access to the system other than through SQL, and it's much faster to do it on their end than bring the data over. However, your idea does intrigue me, and I wonder if (through bettering our diplomatic relations) I may be able to get them to run an .exe each night to do the extraction/computation and pick that up in the ETL.

Like I mentioned to Jeff, I would think that CDC would be an option, but talks on that have gotten no where. I'm trying to make small victories in getting our access expanded...currently trying to just get enough rights to see execution plans on their side...but yes, I would be ecstatic if I could create objects or pass in variables, so that step 5 would be much less "hokey".

If I may ask, what healthcare systems did you deal with? So far, AllScripts Touchworks, GE Healthcare IDX (Centricity), Cerner Millennium, and CPSI's...uhhhh...CPSI...are what I've dealt with mostly.

bantrim SSCommitted Points: 1544 More actions · Answer 7

TheSQLGuru (1/11/2015)
1) you REALLY need to read Thomas Kejser's blog post on hashing in SQL Server: http://kejser.org/exploring-hash-functions-in-sql-server/. I believe he has others.
2) I have used what you are doing for almost 20 years now at a number of clients (quite a few of them healthcare like yours). Works great, mostly. One thing that can really cause problems is collisions, but avoiding that is essentially impossible. You can only reduce the probability.
3) Consider moving the extract and computations out of SQL Server. Purpose-built and coded hashing can be much more efficient than in SQL Server, and you also avoid hitting your main production box unnecessarily.
4) Pursing getting the host system to provide you with just changed records is THE KEY to making your refresh fly. Second would be allowing you to create objects on their system that would allow you to pass in something like a table variable or something similar. That could provide a number of perf benefits.

Ahh!! Thought that link looked familiar. Yes, I read through that page recently. Not the whole thing....you've gotta take that in doses. I love how in depth it is.

bantrim SSCommitted Points: 1544 More actions · Answer 8

Jeff Moden (1/11/2015)
It may be that having the clustered index on the hash would work faster on the lookups, especially if it were a unique index that was guaranteed to be unique by including the PK columns after the hash.
One of the things you might explore in the user meeting is to demonstrate how effective your "poor man's replication" works and how much more effective it could be if the hash values were somehow persisted in the source system (I think it would be easiest to use a persisted column on each table but "sister" tables could be used). That would make your initial downloads for comparison a whole lot faster for everyone because the hash would already have been calculated at little expense on the remote side of the house.

Okay, so here's something interesting/confusing. One table, Encounter, has a unique clustered index on ID (ASC), CIX_Encounter_ID, and I added a nonclustered index on RecordHash (ASC), NIX_Encounter_RecordHash. In running this code, the execution plan chose the clustered index:

SELECT tmp.ID

FROM #Encounter_IDHash tmp

LEFT JOIN astw.Encounter enc ON tmp.ID = enc.ID

WHERE tmp.RecordHash <> enc.RecordHash OR enc.ID IS NULL

Where #Encounter_IDHash is the temp table holding just the ID and Hashbytes fields from the source.

This ran in about 9 minutes.

Just for fun, I created another non clustered index on ID (ASC), RecordHash (ASC), called NIX_Encounter_IDHash. Then rerunning the above sql caused it to pick this index, and it ran in about 6 minutes.

Shouldn't NIX_Encounter_RecordHash and NIX_Encounter_IDHash be identical? Why didn't the optimizer use it in the first scenario?

bantrim SSCommitted Points: 1544 More actions · Answer 9

Okay, if anyone reading this is interested, the results are in.

The benchmarked table held just over 53 million records, and contained 37 columns, totaling around 10 GB total space. The following results are averages over 10 consecutive runs (with cache cleared in between each).

Time to pull only IDs through bcp: 00:02:17

Time to pull ID + Hash through bcp: 00:06:44

Time to pull all columns through bcp: 00:08:12

Time to pull all columns + Hash through bcp: 00:09:21

Size of flat file for all columns + Hash: 10.7 GB

Size of flat file for just IDs + Hash: 2.3 GB (surprised by that one)

Time to load all columns + Hash through BULK INSERT: 00:17:43

Time to load ID + Hash through BULK INSERT: 00:01:50

Time to find deltas for temp table with all columns: 00:34:35

Time to find deltas for temp table with ID + Hash: 00:06:16

Time to pull all columns + Hash for deltas (157,643 records, 500 at a time): 00:01:23

Time to update destination table with deltas (same for both scenarios): 00:06:02

Alternatively, I tested how long it would take to, instead of updating the table with deltas, just reindexing the newly inserted file with all columns plus Hash. However, the reindexing took a little over an hour.

So, in summary...

Total time to update destination table with 0.3% total table changed (157,643 records) via pulling all columns: 01:07:49

Total time to update destination table with 0.3% total table changed (157,643 records) via the ID + Hash method: 00:22:15

As always, your mileage may vary based on any number of variables, but the ID + Hash method definitely wins in this case.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 10

Most of my healthcare work has been directly for services providers or insurers, although I have worked directly with Centricity and a few others, especially on data movement stuff. LOTS of work on the EDI data sets.

I prefer to roll my own "change capture stuff" personally, and have a system for that. CDC is a BEAST from a performance and space perspective and is often unnecessary.

Glad you have things working more quickly.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 11

bantrim (1/11/2015)
Jeff Moden (1/11/2015)
It may be that having the clustered index on the hash would work faster on the lookups, especially if it were a unique index that was guaranteed to be unique by including the PK columns after the hash.
One of the things you might explore in the user meeting is to demonstrate how effective your "poor man's replication" works and how much more effective it could be if the hash values were somehow persisted in the source system (I think it would be easiest to use a persisted column on each table but "sister" tables could be used). That would make your initial downloads for comparison a whole lot faster for everyone because the hash would already have been calculated at little expense on the remote side of the house.
Okay, so here's something interesting/confusing. One table, Encounter, has a unique clustered index on ID (ASC), CIX_Encounter_ID, and I added a nonclustered index on RecordHash (ASC), NIX_Encounter_RecordHash. In running this code, the execution plan chose the clustered index:
SELECT tmp.ID
FROM #Encounter_IDHash tmp
LEFT JOIN astw.Encounter enc ON tmp.ID = enc.ID
WHERE tmp.RecordHash <> enc.RecordHash OR enc.ID IS NULL
Where #Encounter_IDHash is the temp table holding just the ID and Hashbytes fields from the source.
This ran in about 9 minutes.
Just for fun, I created another non clustered index on ID (ASC), RecordHash (ASC), called NIX_Encounter_IDHash. Then rerunning the above sql caused it to pick this index, and it ran in about 6 minutes.
Shouldn't NIX_Encounter_RecordHash and NIX_Encounter_IDHash be identical? Why didn't the optimizer use it in the first scenario?

I'm confused... especially by the last sentence where you mention "NIX_Encounter_IDHash ". That sentence appears to be the only place where you've mentioned that index.

The reason why the NIX version ran faster is because the NIX version has a lot less columns in it than the CIX does meaning that it has more rows per page and therefor has fewer pages to read to do the same job.

Also, are you sure that you need the "OR enc.ID IS NULL"? Isn't the purpose of the query just to find those things in the Temp Table that don't have a matching hash?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

bantrim SSCommitted Points: 1544 More actions · Answer 12

Jeff Moden (1/11/2015)
bantrim (1/11/2015)
Jeff Moden (1/11/2015)
It may be that having the clustered index on the hash would work faster on the lookups, especially if it were a unique index that was guaranteed to be unique by including the PK columns after the hash.
One of the things you might explore in the user meeting is to demonstrate how effective your "poor man's replication" works and how much more effective it could be if the hash values were somehow persisted in the source system (I think it would be easiest to use a persisted column on each table but "sister" tables could be used). That would make your initial downloads for comparison a whole lot faster for everyone because the hash would already have been calculated at little expense on the remote side of the house.
Okay, so here's something interesting/confusing. One table, Encounter, has a unique clustered index on ID (ASC), CIX_Encounter_ID, and I added a nonclustered index on RecordHash (ASC), NIX_Encounter_RecordHash. In running this code, the execution plan chose the clustered index:
SELECT tmp.ID
FROM #Encounter_IDHash tmp
LEFT JOIN astw.Encounter enc ON tmp.ID = enc.ID
WHERE tmp.RecordHash <> enc.RecordHash OR enc.ID IS NULL
Where #Encounter_IDHash is the temp table holding just the ID and Hashbytes fields from the source.
This ran in about 9 minutes.
Just for fun, I created another non clustered index on ID (ASC), RecordHash (ASC), called NIX_Encounter_IDHash. Then rerunning the above sql caused it to pick this index, and it ran in about 6 minutes.
Shouldn't NIX_Encounter_RecordHash and NIX_Encounter_IDHash be identical? Why didn't the optimizer use it in the first scenario?
I'm confused... especially by the last sentence where you mention "NIX_Encounter_IDHash ". That sentence appears to be the only place where you've mentioned that index.
The reason why the NIX version ran faster is because the NIX version has a lot less columns in it than the CIX does meaning that it has more rows per page and therefor has fewer pages to read to do the same job.
Also, are you sure that you need the "OR enc.ID IS NULL"? Isn't the purpose of the query just to find those things in the Temp Table that don't have a matching hash?

Sorry for the confusion. NIX_Encounter_IDHash is the index I mentioned in the third paragraph...the one with ID and RecordHash. The reason I was confused is because I figured that, if I had a clustered index on ID, and a nonclustered index on RecordHash (which includes ID as well), it would be used in my query that joins on ID and checks the RecordHash fields for inequality, but it didn't. It used the clustered index. Only when I added this second nonclustered index (NIX_Encounter_IDHash) did the query use it rather than the clustered index.

I use the OR enc.ID IS NULL check to also pull the ID's for new records. Since I'm handling the load with a merge statement, it's easier to just pull both new and changed records in the same pool.

I hope this wasn't too confusing!

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 13

Ah... understood on the OR.

I'm thinking that a UNIQUE NIX on the ID and Hash (and that combination should be unique because the ID is unique) might be worth trying.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

bantrim SSCommitted Points: 1544 More actions · Answer 14

Jeff Moden (1/11/2015)
Ah... understood on the OR.
I'm thinking that a UNIQUE NIX on the ID and Hash (and that combination should be unique because the ID is unique) might be worth trying.

Good idea, that certainly couldn't hurt, and will probably give a significant boost for the larger tables. Thanks!