Why is it a good idea to not restore a production database to test?

  • (I looked over all the forums here on SSC and this one seemed the best for my question. If you think I'm wrong, I'm sorry I posted my question here.)

    For years I've read here and elsewhere that it is not a good idea to restore a production database to its equivalent test database. I've heard and read this so often that I take it as a given. However, I've never understood why, because I've never read a reason why.

    Now, it has become SOMETHING WE MUST DO NOW AT ALL COSTS AND EVERYTHING ELSE, NO MATTER HOW IMPORTANT IT IS, MUST BE STOPPED SO THAT WE CAN DO THIS NOW!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    OK, I guess. But I still don't know why. So, why is it a bad idea to have production data in a test database?

    Kindest Regards, Rod Connect with me on LinkedIn.

  • There I was, with a written, thank the gods, command, "Whenever a developer asks for a copy of production, give it to them."

    A developer asked and I hooked him up.

    Said developer was, at the time, working on some kind of automated email system. Also, said developer entertained themselves, at work, by sending themselves adult themed images in the email as they tested. So yeah, we ended up sending porn to a chunk of the customer list because we didn't have any rules around cleansing production data prior to giving it to developers. I didn't get fired because of the written instructions. And, let's remember, this was decades ago. I was far too inexperienced to know I should have been cleaning the data, with or without instructions.

    Any kind of personally identifying information in non-production environments is a violation of a bunch of laws, depending on the location and type of data. The General Data Protection Regulation (GDPR) makes it a crime to release individual peoples information without their knowledge. If individuals is an EU citizen and the data was collected while they were in the EU. The very first fine levied under the law was for a Canadian company, not even an EU company. There are fines for the same thing with California citizens data. Several other countries, regions, and states have those kinds of laws. In the US, federally, HIPAA protects health information. In fact, HIPAA gets really mean about it. Most laws fine the organization. HIPAA can jail the individual who knowingly violates the law (and due to mens rea, you now know, you're welcome). Hasn't happened to my knowledge, but hoo boy. As I say when presenting on this topic, I don't want a new roommate named Tiny. PCI data, parts of SOX. The list goes on.

    Could be a bogus study, but 20-25 of all data breaches are from non-production environments. Here are several examples of this happening.

    So, are you in a situation where your production data doesn't meet any laws, rules, or regulations? Then, sure, give production data to people.... but I'd still clean it and make it not look like production. According to Redgate's own surveys, 60% of people are using production data in non-production environments. Scares the crap out of me because I've been there.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thank you, Grant. I really appreciate a reason being given! Some of these things don't apply to my or my organization (I work for a large state government department in southwestern USA). But I think that other parts do. And I kinda feel like it would be better to be doing what either GDPR or California's law stipulates, under the premise that at some point it will apply here.

    As an aside, I am concerned as to how we'll do that. Some of the apps and databases we have are purchased. I would hope that the vendors have encountered this request before and have a solution. But the homegrown apps and databases we've written are a different matter. For reasons I've never understood, all of our homegrown apps have never had a DBA involved in the database design. I don't get that at all, but that's the way it is. I would think the people who really know how to design databases, would be the DBAs. But the culture here is for the developers to design the database, normally in test (we don't typically have a Developer environment), then hand the SQL scripts over to the DBAs to run on the production server. Consequently, we have no tools to generate test data. And I KNOW that my employer won't buy any, because that entails spending money, which they don't want to do. So, if anyone knows of any free test code generating tools, please point them to me.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • Your state may already have a law about it. I was just talking to the IT staff of another state and they had one. I think most states do. But yeah, good luck on it all. It's a true struggle and some people can only learn by being the poor schmuck who sent out the porn.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • with a few exceptions Production data can easily be manipulated to remove/scramble any PII or similar data at which time none of it can be used if it is leaked - care still required with financial data (this should be subject to a different type of scramble) as that info if leaked can have a financial impact on the company as well.

    so as a first step see what you need from prod - then see how you can scramble it (or remove it in some cases e.g. customer service notes about clients would not normally be required at all to be scrambled and should just be removed/replaced with a dummy string), then setup a process where prod db is copied to another db (in a prod server or highly restricted server), scrambled, and the backup of this one is then made available for non prod environments where required.

    And where possible also look at tables that are big, but never really needed for non prod testing, and exclude those from the scrambled database (less space required downstream).

  • Most requests to restore Prod to Dev are due to the dev folk not wanting to build adequate test data. Doing this is costly, it takes time to think through what is needed, time to build it, and time to reiterate the whole process as the code or the business changes.

    So many organisations think they can shortcut this process by restoring Prod to Dev. After all, if the code works on a copy of Prod then it should work on the real Prod. This is fine if you treat the copy of Prod as the only happy cases the code has to deal with, and do not need to worry too much about the sad cases or testing new types of user journey.

    If you want to expand you viewpoint to include legal compliance then all the data scambling already talked about is needed. You should also not assume that all valid user journeys are present in Prod data. In practice it is very rare that a copy of Prod, even when scrambled, will be adequate for the needs of Dev.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • EdVassie wrote:

    . In practice it is very rare that a copy of Prod, even when scrambled, will be adequate for the needs of Dev.

     

    I find that prod data is a lot more useful for finding weird edge cases than test data.  End users are amazing at managing to do the oddest things with software.

    • This reply was modified 4 months, 3 weeks ago by  ZZartin.
  • Thank you, @edvassie, for your response. As a developer, you are right I have no idea of how to create test data for a test database, except for doing data scrambling. I've been in this field for many years. Every place where I've worked has taken a backup of production data, then restored it to a new place for a test database. All restrictions and permissions that are applied to the production database, are also applied to the test database and application(s) that access it.

    I'm not defending this approach as a best practice. I bring it up to say that we did it that way, because we didn't know any better. I've gotten great feedback here for why simply using a recent backup of a production database for a test database isn't a good idea. Although I do still have the problem I mentioned back on August 12th, and that is we need a free tool to generate test data, because our employer will not buy one.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • as I said above you don't necessarily need a tool to generate data - it all depends on what your production data is and what can be copied down without the need to scramble it (which is easyish to do in SQL) - my shop copies several db's from prod down - majority of them will go through a cleaning exercise built in-house to scramble/remove any prod data that is not needed AS IS - and then it can be used for any dev/testing.

    in one of the systems the devs even have a home grown tool that allows them to create smaller db's as a subset of the main copy, where they specify which "contracts" they require to test, and the tool will then copy all required data for those contracts to a new db alongside with all other config tables.

    this same tool is also used for CI/CD testing which runs every night on that system.

  • ZZartin wrote:

    EdVassie wrote:

    . In practice it is very rare that a copy of Prod, even when scrambled, will be adequate for the needs of Dev.

    I find that prod data is a lot more useful for finding weird edge cases than test data.  End users are amazing at managing to do the oddest things with software.

    That's been my experience as well. I can build out meaningful test data well enough. But for true crazy, I rely on the people consuming the app.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • frederico_fonseca wrote:

    as I said above you don't necessarily need a tool to generate data - it all depends on what your production data is and what can be copied down without the need to scramble it (which is easyish to do in SQL) - my shop copies several db's from prod down - majority of them will go through a cleaning exercise built in-house to scramble/remove any prod data that is not needed AS IS - and then it can be used for any dev/testing.

    in one of the systems the devs even have a home grown tool that allows them to create smaller db's as a subset of the main copy, where they specify which "contracts" they require to test, and the tool will then copy all required data for those contracts to a new db alongside with all other config tables.

    this same tool is also used for CI/CD testing which runs every night on that system.

    OK, I am not being clear enough. I'll try to make it painfully obvious. One of the databases is a third-party app and database. The database is much larger and more complex than anything I've ever written, even when working on a team with other developers. Under these circumstances I expect that a software package that obfuscates data will recognize a column's name as containing sensitive data and needing to have its values obfuscated. Like it would recognize a column with a name of FirstName as one whose values must be obfuscated.

    Now, that's the easy case. When it comes to trying to obfuscate all possible columns which have PII or PHI, such as what are identified by GDPR or the California's law equivalent, then I can guarantee you with 100% certainty that I will not know all possible names for such columns. So, if it is left up to me to obfuscate all columns with data that would need to be obfuscated, then I will fail, due to ignorance. Therefore, it would help us a lot if the obfuscating software could recognize all of the names used for such columns, because odds are very high that we won't know them all.

    On the other hand, if none of the obfuscating software have the capability of recognizing columns with names that convey the data contained within is either PII or PHI, then we're doomed to fail. And I'd like to know if that is the state of affairs with such obfuscating software.

    Kindest Regards, Rod Connect with me on LinkedIn.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply