Best way to merge two large client databases

  • Assuming the identity seeds were both 1, loading with IDENTITY_INSERT ON using the original identity *-1 (what Phil said) is where I would start the conversation. If the client strongly objects I would then move to padding (what Eric said). Padding is plan B only because it requires more engineering.

    Pseudo-steps:

    1. Script out then drop all FKs from the target database.

    2. Create an SSIS Package that copies all tables from source to target. In the Data Flow Task, in the Source OLE DB conponent is where you can specofy the SELECT tjat changes the Identity value to negative. Make sure the Destination OLE DB component allows identity insert.

    3. Add FKs back to target.

    SSIS would fly becuase you get parallelism for free, i.e. your Control Flow would have 1 Data Flow Task for every table and they would all fire in parallel only governed by the max threads of your Package.

    If SSIS is not your thing BCP to Native Format qould be my plan B but it will be much slower unless you engineer the parallel loading of tables on your own.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo (12/30/2015)


    Assuming the identity seeds were both 1, loading with IDENTITY_INSERT ON using the original identity *-1 (what Phil said) is where I would start the conversation. If the client strongly objects I would then move to padding (what Eric said). Padding is plan B only because it requires more engineering.

    It seems to me that both the (ID + X) and the (ID * -1) technique would require the same level of effort in terms of modifying the scripts, and the remaining steps would be the same. If there are more than 2 separate client databases that need consolidating, then (ID + X) method would make even more sense.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (12/30/2015)


    Orlando Colamatteo (12/30/2015)


    Assuming the identity seeds were both 1, loading with IDENTITY_INSERT ON using the original identity *-1 (what Phil said) is where I would start the conversation. If the client strongly objects I would then move to padding (what Eric said). Padding is plan B only because it requires more engineering.

    It seems to me that both the (ID + X) and the (ID * -1) technique would require the same level of effort in terms of modifying the scripts, and the remaining steps would be the same. If there are more than 2 separate client databases that need consolidating, then (ID + X) method would make even more sense.

    Sure, but this thread is specifically about merging two databases.

    The fact that you get to keep your original Ids, albeit in negative form, (and as a consequence do not lose any available positive Ids) along with the fact that there is no need to worry about calculating good padding values (different for each table containing a discrete IDENTITY?), makes my suggestion faster to code, IMO.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Eric, agree it's not a ton more but Phil expressed it in more detail. This is basically my thoughts on it and where my comment about requiring less engineering (i.e. less code and not having to account for different padding levels) came from:

    Phil Parkin (12/30/2015)


    ...along with the fact that there is no need to worry about calculating good padding values (different for each table containing a discrete IDENTITY?), makes my suggestion faster to code, IMO.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo (12/30/2015)


    Eric, agree it's not a ton more but Phil expressed it in more detail. This is basically my thoughts on it and where my comment about requiring less engineering (i.e. less code and not having to account for different padding levels) came from:

    Phil Parkin (12/30/2015)


    ...along with the fact that there is no need to worry about calculating good padding values (different for each table containing a discrete IDENTITY?), makes my suggestion faster to code, IMO.

    OK, but for what it's worth, the (ID + X) solution scales to more than 2 clients, and if the insert scripts are parameterized, they can be re-used for each client with little or no modification other than specifying the client's starting offset. So it's really only one set of scripts.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (12/30/2015)


    Orlando Colamatteo (12/30/2015)


    Eric, agree it's not a ton more but Phil expressed it in more detail. This is basically my thoughts on it and where my comment about requiring less engineering (i.e. less code and not having to account for different padding levels) came from:

    Phil Parkin (12/30/2015)


    ...along with the fact that there is no need to worry about calculating good padding values (different for each table containing a discrete IDENTITY?), makes my suggestion faster to code, IMO.

    OK, but for what it's worth, the (ID + X) solution scales to more than 2 clients, and if the insert scripts are parameterized, they can be re-used for each client with little or no modification other than specifying the client's starting offset. So it's really only one set of scripts.

    No matter the number of sources of clients I think I like the idea of keeping the original values as well, it still works. Plus there may be old scripts that reference client IDs in code. Yes I know it is not recommended that people code this way but is nonetheless a reality that I've had to deal with myself in maintaining others code.

    If there is a base table that contains say over 80% of the clients then why not keep those id's and re-invent the others?

    Below is a illustration of what others have mentioned here > , just quick and dirty

    create table #testIDinsert(

    IdCol tinyint identity(1,1),

    someString varchar(20)

    )

    insert into #testIDinsert

    select 'bb8 and r2d2' union all

    select 'once upon' union all

    select 'a midnight' union all

    select 'dreary, ' union all

    select 'as i wandered' union all

    select 'weak and ' union all

    select 'weary'

    select * from #testIDinsert

    create table #testIDinsert2(

    IdCol tinyint identity(1,2), /* note that it is set to add 2 from the previous ID */

    someString varchar(20)

    )

    set identity_insert #testIDinsert2 on

    GO

    insert into #testIDinsert2(IdCol, someString)

    SELECT * from #testIDinsert

    set identity_insert #testIDinsert2 off;

    go

    select * from #testIDinsert2

    insert into #testIDinsert2

    SELECT 'anything'

    select * from #testIDinsert2 /* the next ID was 9 and not 8 since we set identity insert off again */

    /* -------clean up--*/

    drop table #testIDinsert

    drop table #testIDinsert2

    ----------------------------------------------------

  • Eric M Russell (12/30/2015)


    Orlando Colamatteo (12/30/2015)


    Eric, agree it's not a ton more but Phil expressed it in more detail. This is basically my thoughts on it and where my comment about requiring less engineering (i.e. less code and not having to account for different padding levels) came from:

    Phil Parkin (12/30/2015)


    ...along with the fact that there is no need to worry about calculating good padding values (different for each table containing a discrete IDENTITY?), makes my suggestion faster to code, IMO.

    OK, but for what it's worth, the (ID + X) solution scales to more than 2 clients, and if the insert scripts are parameterized, they can be re-used for each client with little or no modification other than specifying the client's starting offset. So it's really only one set of scripts.

    Granted, it would scale to more than one client. For padding I wasn't thinking scripts. Rather than just multiplying the ID by -1 in the source select statement, to pad I would first look at the target table to get the max ID to determine the form of the source select statement. So no scripts to maintain per se, but I need to know info about the target before I can run the migration process which is where the additional engineering was coming in from my perspective. If multiple clients merging into one database were a requirement padding would probably be where I would start.

    As an aside, and I have not used it for production purposes yet, but it seems like BIML could possibly have a role to play in this solution to cut down on some of the manual development for the 500+ tables mentioned. With either the padding or the -1 solution the n-tables are all going to follow the same pattern of migration.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Yes its true that the -1 solution would do for this thread although I am a bit hesitant to use it in case the application or reports on this database have any hardcoded logic using a greater than on theses IDs. Negative IDs would not pass such criteria. I understand hardcoding is never a good idea but its always a possibility..

    Secondly, I would like to touch on the suggestion to disable FKs. Dont you think its safer to leave the FKs in place for checks and balances as a way to ensure i dont insert anything without the proper reference? I believe I already have the order by which the tables should be merged.

    Lastly, if noone can suggest a tool to do this maybe its a good idea to built one after this project 😀

  • titsiros (12/30/2015)


    Yes its true that the -1 solution would do for this thread although I am a bit hesitant to use it in case the application or reports on this database have any hardcoded logic using a greater than on theses IDs. Negative IDs would not pass such criteria. I understand hardcoding is never a good idea but its always a possibility..

    That is certainly a valid concern.

    Secondly, I would like to touch on the suggestion to disable FKs. Dont you think its safer to leave the FKs in place for checks and balances as a way to ensure i dont insert anything without the proper reference? I believe I already have the order by which the tables should be merged.

    In my thinking this step was to avoid having to determine the proper order in which the tables would have to migrate. If you have that in hand or are willing to tackle that task then by all means leave the RI in place.

    To answer the query though, the recommendation is definitely to re-instate the FKs once the data migration is complete to ensure all relationships are maintained. Having the keys on or off does not affect the probability of you maintaining RI through the process, i.e. a flaw has an equal chance to be caught or go unnoticed when comparing both methods. In thinking about this you may also need to drop some CHECK constraints in case they call functions that refer to other tables. If you have triggers trying to enforce RI then it gets even more fuzzy on how to move forward in which case I would fall back to putting in the effort to determine the proper order to migrate tables.

    Lastly, if noone can suggest a tool to do this maybe its a good idea to built one after this project 😀

    Using padding does introduce additional concerns beyond just padding the Identity on each discrete table. The child tables must be padded accordingly and since, unlike the -1 method where the identity is easily predictable, you will need to engineer that way to keep track of how much padding each table requires and then push that number down into all child tables you migrate as well. I said earlier that this may not be a "ton more" but it could turn into a lot of work as you get into sub-models with 3, 4, 5 or more child tables that need the treatment.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • MMartin1 (12/30/2015)


    Eric M Russell (12/30/2015)


    Orlando Colamatteo (12/30/2015)


    Eric, agree it's not a ton more but Phil expressed it in more detail. This is basically my thoughts on it and where my comment about requiring less engineering (i.e. less code and not having to account for different padding levels) came from:

    Phil Parkin (12/30/2015)


    ...along with the fact that there is no need to worry about calculating good padding values (different for each table containing a discrete IDENTITY?), makes my suggestion faster to code, IMO.

    OK, but for what it's worth, the (ID + X) solution scales to more than 2 clients, and if the insert scripts are parameterized, they can be re-used for each client with little or no modification other than specifying the client's starting offset. So it's really only one set of scripts.

    No matter the number of sources of clients I think I like the idea of keeping the original values as well, it still works. Plus there may be old scripts that reference client IDs in code. Yes I know it is not recommended that people code this way but is nonetheless a reality that I've had to deal with myself in maintaining others code.

    ...

    As for the desire of keeping the original IDs; there can be overlapping IDs between the clients, so when consolidating records into one database only (1) of the clients would have their original, while the others are assigned different IDs. Since each client currently has their own database, there probably are no pre-existing ClientIDs, so here we're talking about how to handle IDs on transactional records (OrderID, CustomerID, etc.).

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Aside from the task of somehow accounting for the same client across the two databases potentially, a thought I have is that in theory you can load the ids for the bigger set into the new database. Then for the next system... here I take Eric suggestion, except to offset by the very difference between the two sets.

    EX//

    Set 1 (bigger ) max clientID = 1,500,000 and Set 2 (smaller) min clientID = 1. Here insert set 1 as is and set 2 will have new clientId = clientID+1,500,000. You may want to still store the old key in a legacyID column as well as the source just for easy reference.

    ----------------------------------------------------

  • If you want and even more difficult task, try explaining to management types why this is not simple and easy thing to do.

    Adding a fixed amount to identity keys can convert them to non-overlapping ranges. This works pretty well for tables that are truly just data.

    Many databases also contain "Type Tables" that are used to populate drop downs. In configurable system users maybe able to add new types to the system. Some of these values will be original and occur in both tables. Those values should not be duplicated when merging the data. Other will be user created. Merging this kind of data gets even more harry. You probably need to build a cross reference source->destination mapping. Just figuring out all the references in an old database that did not always use RI is a huge effort.

  • michael.welcome (1/11/2016)


    If you want and even more difficult task, try explaining to management types why this is not simple and easy thing to do.

    Adding a fixed amount to identity keys can convert them to non-overlapping ranges. This works pretty well for tables that are truly just data.

    Many databases also contain "Type Tables" that are used to populate drop downs. In configurable system users maybe able to add new types to the system. Some of these values will be original and occur in both tables. Those values should not be duplicated when merging the data. Other will be user created. Merging this kind of data gets even more harry. You probably need to build a cross reference source->destination mapping. Just figuring out all the references in an old database that did not always use RI is a huge effort.

    Take 2 different pens and put them on the table.

    They may be completely different in design, supplier, and price.

    But each database has them with the same description.

    Or they could be the same, but the descriptions are different in each database.

    Worked for a global company and put information from sites around the globe, some with the same core ERP system.

    In each system, we did at least get a common commodity class to be used across all systems.

    This was put into an SSAS cube, with surrogate key for the originating company.

    So the descriptions could be the same, although you could still see where a pen might not be the same pen.

    It is indeed a difficult task, even with the same schema.

    You have the same issue with Suppliers, Customers, etc.

    Which sometimes can be solved by address, but sometimes not.

    You can also have parent / child relationships between these.

    In the end, it can be rather easy to show examples of the pitfalls that can be easily understood why this is not as simple as it sounds.

    Exactly why the sources are being brought together, and the business use need to be questioned.

    The more important the business decisions are that the resulting data will be used for might be a factor into how much cleansing needs to take place.

  • Greg Edwards-268690 (1/12/2016)


    michael.welcome (1/11/2016)


    If you want and even more difficult task, try explaining to management types why this is not simple and easy thing to do.

    Adding a fixed amount to identity keys can convert them to non-overlapping ranges. This works pretty well for tables that are truly just data.

    Many databases also contain "Type Tables" that are used to populate drop downs. In configurable system users maybe able to add new types to the system. Some of these values will be original and occur in both tables. Those values should not be duplicated when merging the data. Other will be user created. Merging this kind of data gets even more harry. You probably need to build a cross reference source->destination mapping. Just figuring out all the references in an old database that did not always use RI is a huge effort.

    Take 2 different pens and put them on the table.

    They may be completely different in design, supplier, and price.

    But each database has them with the same description.

    Or they could be the same, but the descriptions are different in each database.

    Worked for a global company and put information from sites around the globe, some with the same core ERP system.

    In each system, we did at least get a common commodity class to be used across all systems.

    This was put into an SSAS cube, with surrogate key for the originating company.

    So the descriptions could be the same, although you could still see where a pen might not be the same pen.

    It is indeed a difficult task, even with the same schema.

    You have the same issue with Suppliers, Customers, etc.

    Which sometimes can be solved by address, but sometimes not.

    You can also have parent / child relationships between these.

    In the end, it can be rather easy to show examples of the pitfalls that can be easily understood why this is not as simple as it sounds.

    Exactly why the sources are being brought together, and the business use need to be questioned.

    The more important the business decisions are that the resulting data will be used for might be a factor into how much cleansing needs to take place.

    Trying to reign this in a bit because, and I could be wrong, but I think this is a tangential scenario to the one raised by the OP. It can get murky in a hurry if you need to consider blending the data but when the two databases you need to merge contain mutually exclusive datasets it becomes a simple matter of maintaining RI.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • The reason for the merge is because one client acquired the other and they don't want to maintain separate systems.

    I know it's a grand task because the db schema is fully normalized with plenty of FKs and Parent_ids. Having said that, I know I m not reinventing the wheel here so I was hoping there is a tool that can do this. The steps I need to take seem all programmable to me so in my opinion a toll should be able to do this, based on some configuration.

Viewing 15 posts - 16 through 30 (of 48 total)

You must be logged in to reply to this topic. Login to reply