Creating a subset of data

  • Thanks for reading and hopefully helping me find a solution...
    I have a database that is about 750GB and i would like to create the same database that is 50GB(or smaller)... I need all aspects(tables, views, schemas,etc...etc) of the larger one in the smaller but not sure how to go about it.
    Any ideas are appreciated.

    Thanks for your time

    DHeath

    DHeath

  • DHeath - Friday, January 18, 2019 8:26 AM

    Thanks for reading and hopefully helping me find a solution...
    I have a database that is about 750GB and i would like to create the same database that is 50GB(or smaller)... I need all aspects(tables, views, schemas,etc...etc) of the larger one in the smaller but not sure how to go about it.
    Any ideas are appreciated.

    Thanks for your time

    DHeath

    You want to create a copy of the database, but without the data?

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Thanks for the reply Phil,
    I want to create a copy of the database with only partial data...50GB or less of the 700+GB original dataset

    DHeath

    DHeath

  • your problem is going to be referential integrity.  where you have foreign keys you will need to identify all of the dependent records.
    If you are building a test or demo system you are also going to have GDPR issues.

    Lets say that it is a patient management system and you want to take all of the patients that were treated in 2017.

    You are going to need the treatment plan records for 2017, their Patient records and their medical history which probably means that you need more than just the 2017 appointments, their contact details, the doctors and nurses, their team leaders, shift patterns, work locations, medicines prescribed, batch numbers, other patients treated with those batches, other drugs used by those patients in the batches, their patient details and trearment plans  

    If you have a system on any complexity, then it is almost impossible to give a generic answer.  Some of the folk on this board get big bucks to solve these problems on a case by case basis.

    Redgate do have some good tools for generating fake data.

  • Aaron,

    Thanks for the reply and i am thinking i will have to most likely get a third party tool for this.  Thanks for your comment and you are 100% correct on all areas touched.

    Thanks again

    DHeath

    DHeath

  • I have had the need to do this as well.  As Aaron points out, referential integrity will need to be maintained in order for your subset to make any sense.

    So, let's say you have a contact management system for the U.S. and you want to create a subset of data based on region.  Hopefully you have well defined foreign keys.

    You will need to begin with that table (dbo.Region) and then look at all of the other tables which have a dependency on this table. 

    SELECT OBJECT_NAME(c.object_id) FROM sys.columns c WHERE c.[name] = 'RegionID'

    Something like that.  Then, you will need to do the same thing for each table you discover.  Each table.  And then do the same thing for those tables ... and so on.  That is why I am hoping you have well defined foreign keys as it makes it quite nice.

    THEN, after copying the DDL for each and EVERY object you have discovered (check computed columns, defaults ...) you begin copying data from the top down.

    I have written a process to do this for my company to create data subsets.  It took me about 3 months, but I use it constantly.  Well worth the effort.

    BTW, any 3rd party tool will navigate the foreign key chain, so, again, I hope they are well defined.

  • The approaches already suggested are the way to go, but you need to consider the criteria you use when defining the subset (I assume you're intending to use the smaller copy for testing, demos, training, etc.). In my situation (an investment management company), we had to work with users in the business to ensure we picked a representative cross-section of clients, but we also had to cover edge cases so that the data would be valuable for regression testing.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply