Creating a subset of data

Question

Creating a subset of data

DHeath

SSCertifiable

Points: 5200
More actions
January 18, 2019 at 8:26 am

#376860

Thanks for reading and hopefully helping me find a solution...
I have a database that is about 750GB and i would like to create the same database that is 50GB(or smaller)... I need all aspects(tables, views, schemas,etc...etc) of the larger one in the smaller but not sure how to go about it.
Any ideas are appreciated.
Thanks for your time
DHeath
DHeath

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246979 More actions · Answer 1

DHeath - Friday, January 18, 2019 8:26 AM
Thanks for reading and hopefully helping me find a solution...
I have a database that is about 750GB and i would like to create the same database that is 50GB(or smaller)... I need all aspects(tables, views, schemas,etc...etc) of the larger one in the smaller but not sure how to go about it.
Any ideas are appreciated.
Thanks for your time
DHeath

You want to create a copy of the database, but without the data?

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

DHeath SSCertifiable Points: 5200 More actions · Answer 2

Thanks for the reply Phil,
I want to create a copy of the database with only partial data...50GB or less of the 700+GB original dataset

DHeath

aaron.reese SSChampion Points: 13498 More actions · Answer 3

your problem is going to be referential integrity. where you have foreign keys you will need to identify all of the dependent records.
If you are building a test or demo system you are also going to have GDPR issues.

Lets say that it is a patient management system and you want to take all of the patients that were treated in 2017.

You are going to need the treatment plan records for 2017, their Patient records and their medical history which probably means that you need more than just the 2017 appointments, their contact details, the doctors and nurses, their team leaders, shift patterns, work locations, medicines prescribed, batch numbers, other patients treated with those batches, other drugs used by those patients in the batches, their patient details and trearment plans

If you have a system on any complexity, then it is almost impossible to give a generic answer. Some of the folk on this board get big bucks to solve these problems on a case by case basis.

Redgate do have some good tools for generating fake data.

DHeath SSCertifiable Points: 5200 More actions · Answer 4

Aaron,

Thanks for the reply and i am thinking i will have to most likely get a third party tool for this. Thanks for your comment and you are 100% correct on all areas touched.

Thanks again

DHeath

Davis H Right there with Babe Points: 727 More actions · Answer 5

I have had the need to do this as well. As Aaron points out, referential integrity will need to be maintained in order for your subset to make any sense.

So, let's say you have a contact management system for the U.S. and you want to create a subset of data based on region. Hopefully you have well defined foreign keys.

You will need to begin with that table (dbo.Region) and then look at all of the other tables which have a dependency on this table.

SELECT OBJECT_NAME(c.object_id) FROM sys.columns c WHERE c.[name] = 'RegionID'

Something like that. Then, you will need to do the same thing for each table you discover. Each table. And then do the same thing for those tables ... and so on. That is why I am hoping you have well defined foreign keys as it makes it quite nice.

THEN, after copying the DDL for each and EVERY object you have discovered (check computed columns, defaults ...) you begin copying data from the top down.

I have written a process to do this for my company to create data subsets. It took me about 3 months, but I use it constantly. Well worth the effort.

BTW, any 3rd party tool will navigate the foreign key chain, so, again, I hope they are well defined.

Chris Wooding SSCarpal Tunnel Points: 4558 More actions · Answer 6

The approaches already suggested are the way to go, but you need to consider the criteria you use when defining the subset (I assume you're intending to use the smaller copy for testing, demos, training, etc.). In my situation (an investment management company), we had to work with users in the business to ensure we picked a representative cross-section of clients, but we also had to cover edge cases so that the data would be valuable for regression testing.