Best way to merge two large client databases

Question

Post reply

Best way to merge two large client databases

titsiros

SSC Eights!

Points: 832
More actions
December 23, 2015 at 1:53 pm

#323527

I have two client databases (same schema, different data) that I need to merge into a single database. These are large databases with millions of records, over 500 tables with lots of Foreign keys.
My strategy is to first identify the primary tables and then insert the missing records in the destination db. Next, is any table referencing this primary table, using the newly created IDs as FK. Repeat for each secondary table.
I know this is not going to be easy and i would imagine merging same schema databases must be a common task. Are there any tools that will do this for me? I need to ensure the tool does not save FKs from source db but swaps to the newly created ones. I am also concerned how a tool will be able to process millions of records. Finally, how does a tool automatically tier the tables so they are merged in the right sequence?
I have looked at Red gate and ApexSQL data compare/diff but i did not see any specifics on how to deal with table relationships (swap FKs and right sequence)

Viewing 15 posts - 1 through 15 (of 49 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 1

Do the tables use IDENTITY columns for the PK of the tables?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

titsiros SSC Eights! Points: 832 More actions · Answer 2

titsiros

SSC Eights!

Points: 832

December 23, 2015 at 8:40 pm

#1847874

Yes

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 3

titsiros (12/23/2015)
Yes

It's going to a bugger then. One of the two client's IDs are all going to have to change and you'll need to ensure that reference tables are exactly the same or you need to update those IDs, as well. This is no mean task. Are you sure that it actually needs to be done?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

titsiros SSC Eights! Points: 832 More actions · Answer 4

yeah i know its going to be quite a task...the PKs of the inserted records will get a new ID which is what makes it tricky when those new records will have to be referenced by all the referring tables..I understand what needs to be done but I figured i m not reinventing the wheel here..so isn't there a tool that does this? Up to now i checked ApexSQL Data Diff and even though they can find new records that need to make it over to the other database, i dont think they deal with swapping the FK IDs to the new ones.

Any ideas? I m hoping that i wont have to write all the scripts from scratch (500 tables) but below is the general idea

-----------merge primary table by inserting all the records not found

MERGE dbtarget.dbo.categories AS t

USING dbsource.dbo.categories AS s

ON (s.category_NAME=t.category_NAME )

WHEN NOT MATCHED BY TARGET

THEN INSERT( field1,...fieldN) values ( field1,...fieldN)

------------merge secondary table by changing the FK IDS to the new ones ( tP.CATEGORY_ID)

MERGE dbtarget.dbo.product AS t

USING (

select sF.product_ID, sF.product_NAME, tP.CATEGORY_ID, field1,...fieldN

from dbsource.dbo.product sF

inner join dbsource.dbo.categories sP on sP.CATEGORY_ID=sF.CATEGORY_ID

inner join dbtarget.dbo.categories tP on tP.CATEGORY_NAME=sP.CATEGORY_NAME

) AS s

ON (s.product_NAME=t.product_NAME )

WHEN NOT MATCHED BY TARGET

THEN INSERT( field1,...fieldN) values ( field1,...fieldN)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 5

It's most likely because I've never had to do such a thing but I don't know of any tool designed to do something like this. There are tools like RedGate's SQL Compare and Data Compare that can help you identify what may need to be changed as well as some commonalities but nothing that will search out IDENTITY columns and their dependencies to even make recommendations as to what needs to be done.

Apologies for not being able to help more but combining such databases is one of those tasks that I've not personally had to do on a large scale. If I did, I'd be a proc writing monster that exploited FK's and the like. Even with that, there are some places where people didn't use DRI even though there may be a relationship between tables. Even if there were a tool to do what you ask, that design failure would still require a bit of data prestidigitation on the part of you and your team.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

titsiros SSC Eights! Points: 832 More actions · Answer 6

The reason why I am asking for such a tool is because I m sure combining client databases of the same schema must be somewhat of a common task when there are mergers etc. apex and red Gate don't seem to be dealing with the new FKs issue..

drew.allen SSC Guru Points: 77014 More actions · Answer 7

titsiros (12/26/2015)
The reason why I am asking for such a tool is because I m sure combining client databases of the same schema must be somewhat of a common task when there are mergers etc. apex and red Gate don't seem to be dealing with the new FKs issue..

In all my years working with SQL, I have NEVER had to combine two databases that used the same schema. They have always used different schema. Perhaps this isn't as common as you think.

Even if it were common, the complexity of creating and testing a general solution would be very daunting. It's much more likely that a solution exists for the specific database that you are using, so you should probably contact the vendor for your specific database design to see if they have a utility.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

titsiros SSC Eights! Points: 832 More actions · Answer 8

I would think different schema is more difficult. Same schema merges would be common in merging departmental databases or even company mergers that have been using the same software. Or even Data Warehouses that need to merge data from different locations..how do they do it?

Eric M Russell SSC Guru Points: 125623 More actions · Answer 9

I'm assuming the scenario is that you have two clients in separate SQL Server instances, each having overlapping sequential integer identifiers, and so you need to assign new identifiers as the data is consolodated into the new database while simultaneously maintaining referential integrity between primary key and foreign key tables. OK, I've actually done something similar to this in the past using a process similar to what I'm describing below.

Rather than letting SQL Server assign new identity values during the initial load process, instead SET IDENTITY_INSERT ON for each table with an identity and have the SELECT statements for the load compute new IDs by adding a fixed amount. The goal here is to bump up the primary and foreign key values in each table in such a way that they stay in sync referentially, but they don't overlap between clients.

For example:

When loading data for client #1, select ID as (ID + 10,000,000).

When loading data for client #2, select ID as (ID + 20,000,000).

So, CustomerID 3,782,312 for client #1 now becomes 13,782,312, and CustomerID 3,782,312 for client #2 now becomes 23,782,312. An identifier of type INT has a max range of 2,147,483,647, you have plenty of room to pad.

After the load process has completed, then SET IDENTITY_INSERT OFF for each table so that SQL Server will resume incrementing the identity columns going forward.

Also, because the identifiers in the new database have values that are similar to the old database(s), it will simply process of post-load data validation.

Another thing that will simplify this process would be to re-create foreign key constraints in the new database only after all the data has been loaded, so you don't have to be concerned about the order in which the tables are loaded.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Eric M Russell SSC Guru Points: 125623 More actions · Answer 10

I'm not aware of a tool that will automate what I've described above, but if you script out INSERT.. SELECT.. scripts for each of the 500 tables using SSDT or RedGate, then you can use a text search / replace tool to stub in the computed expression for references to identifier columns in the SELECT statements. Prepping the scripts would require a couple days of manual effort. The most time consuming step is follow-up QA on the data and application integration testing, all of which could take weeks, but you'll have to do that regardless of what load process you ultimately choose.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

titsiros SSC Eights! Points: 832 More actions · Answer 11

Thanks Eric. It seems you did do this before 🙂 I like the ID padding tip you gave me so i can match records that way instead of another field. Also, the order issue should go away if i remove the FKs.

As for a tool, I understand it could be a starting point but surprised there is no tool to do this because all the steps are programmable..

Phil Parkin SSC Guru Points: 247224 More actions · Answer 12

Eric M Russell (12/28/2015)
I'm assuming the scenario is that you have two clients in separate SQL Server instances, each having overlapping sequential integer identifiers, and so you need to assign new identifiers as the data is consolodated into the new database while simultaneously maintaining referential integrity between primary key and foreign key tables. OK, I've actually done something similar to this in the past using a process similar to what I'm describing below.
Rather than letting SQL Server assign new identity values during the initial load process, instead SET IDENTITY_INSERT ON for each table with an identity and have the SELECT statements for the load compute new IDs by adding a fixed amount. The goal here is to bump up the primary and foreign key values in each table in such a way that they stay in sync referentially, but they don't overlap between clients.
For example:
When loading data for client #1, select ID as (ID + 10,000,000).
When loading data for client #2, select ID as (ID + 20,000,000).
So, CustomerID 3,782,312 for client #1 now becomes 13,782,312, and CustomerID 3,782,312 for client #2 now becomes 23,782,312. An identifier of type INT has a max range of 2,147,483,647, you have plenty of room to pad.
After the load process has completed, then SET IDENTITY_INSERT OFF for each table so that SQL Server will resume incrementing the identity columns going forward.
Also, because the identifiers in the new database have values that are similar to the old database(s), it will simply process of post-load data validation.
Another thing that will simplify this process would be to re-create foreign key constraints in the new database only after all the data has been loaded, so you don't have to be concerned about the order in which the tables are loaded.

Another possibility is to set the Identity value in the Merged database to (OriginalIdentity * -1). This makes it easy for you to cross reference back to source (should you ever need to), and does not use up any of the headroom left for future identities. But the idea of identity columns containing negative values is too alien for some.

titsiros SSC Eights! Points: 832 More actions · Answer 13

Hi Phil, thanks for the -1 tip, that would work also..Do you know of any tool that would do this instead of scripting this manually? I looked into Redgate and ApexSQL but not convinced they can maintain referential integrity after the inserts

Phil Parkin SSC Guru Points: 247224 More actions · Answer 14

titsiros (12/30/2015)
Hi Phil, thanks for the -1 tip, that would work also..Do you know of any tool that would do this instead of scripting this manually? I looked into Redgate and ApexSQL but not convinced they can maintain referential integrity after the inserts

Sorry, but I know of no such tool.

I'm pretty sure that there's some clever SQL out there that will generate a list of tables in dependency order. Armed with this, at least you could code your conversion in the knowledge that it should not break FK dependencies.