August 21, 2014 at 8:17 pm
Comments posted to this topic are about the item Production Subsets
August 21, 2014 at 11:51 pm
Unless there is a compelling reason not to, developing on full production data is best.
Consider:
[p]Simple things like a NULL where you didn't expect one, or a non-numeric, or a duplicate value can cause unexpected problems when you move to production unless you developed and tested against full production data[/p]
[p]That CTE or Correlated Subquery might work fine with modest numbers of rows...but may need to be refactored to hold up under the demands of full production data. It is better to test and tune as you go.
[/p]
[p]Sure, in a perfect world there is documentation about the data structures, and there is good referential integrity. But in the real world in which I live, sometimes there is neither, and reverse-engineering and research is required. I have often come across situations where there may be thousands of rows in a table with nulls in a column...and then rows that have a value. If I had only a subset of production data, I could likely interpret that a column was not being used, when in fact it actually it was.[/p]
[p]Imaginary data and simple "Test 1" data introduce an additional layer of abstraction for both developers and stakeholders. "Imagine that is a real order detail record.", vs. looking at an actual order detail. Besides the unexpected data that may affect behavior, looking at dummy data tends to cause us to gloss over problems because we aren't really interpreting what we are seeing in real world terms
[/p]
[p]Some of development is about improving data and the way it is organized. For example, you might notice that you have long address values that are being truncated, and an address2 column is needed. Or that you are in danger of overflowing an int identity column. Or need for data cleansing for rows imported from another source. Or any number of opportunities for improvement. These things might not be on a requirements document or even on an agile story, but they are things that should be corrected. There is a better chance of finding opportunities for improvement when you are seeing more data.
[/p]
[p]Just today I was developing a complex stored procedure. I was testing against full production data, and it worked fine. I handed the code off to another developer who was testing against a dev data...and my code did not return the expected results. It turned out that the problem was with the dev data. No big deal, but there was confusion and some wasted time that would have been avoided if we both were using full production data.
[/p]
August 22, 2014 at 1:54 am
As always there is a balance to be had and it depends very much on both the task one is doing and the conditions one is doing them under.
My preference is for developers to work against a local database with a subset of data (which is maintained through source control in some way i.e. can be completely ditched and replaced with a couple of clicks / commands), with the Continuous Integration server running tests on check-in against a large (full if it is not MASSIVE - I do mean absolutely massive) set of data and a regular CI build with fully instrumented performance tests.
My reasoning is that it is often too expensive timewise to run the full suite of performance tests on every check-in (otherwise it encourages more infrequent and larger check-ins), developers will not want to return to work that is non-performant so will often consider performance if they know they have to pick up their own failures, developers will often continuously defer performance issues unless it breaks the build (only performance tests do that) also in some cases stored procedure performance is outside of the scope of the developer.
Some people may immediately shout at me for the last point but I have worked on projects where the 3GL developers write stored procedures as best as they can but Database Developers (DBAs) accept the stored procedures and treat each of them as a specification. Sometimes this involves them rewriting the stored procedure but keeping identical the signature and output, occasionally there is an advised / required change to the signature and sometimes the stored procedure needs no alteration. Under these conditions I have always believed that the 3GL developers should always aim for the latter but accept that either of the former two may be the case.
Gaz
-- Stop your grinnin' and drop your linen...they're everywhere!!!
August 22, 2014 at 2:59 am
I'm with David and Gary. I'm not in a shop where we have subsets of devs like 3GL and DB specific types. We have got as database guy but to get him to look at a project is very unlikely as he will be sorting out other stuff like DTS/SSIS packages and data cleansing exercises. The point about working to get adequate performance with realistic data is the most important, but all of what David said was relevant.
August 22, 2014 at 3:04 am
As long as the data is scrubbed so that it doesnt break the data protection laws then a full dataset can be easier. But that always takes time and most developers want the unaltered data just in case the changes break their code.
But, a subset is better cause it can be clean. Why write more complex logic when the base data needs to be cleaned in the first place.
August 22, 2014 at 3:18 am
Yet Another DBA (8/22/2014)
As long as the data is scrubbed so that it doesnt break the data protection laws then a full dataset can be easier. But that always takes time and most developers want the unaltered data just in case the changes break their code...
There are tools. I have not used them myself but I have used the output and been very satisfied. It was realistic data (not directly but in terms of various lengths, NULLs, etc.) and ensured that I was compliant.
Gaz
-- Stop your grinnin' and drop your linen...they're everywhere!!!
August 22, 2014 at 3:39 am
I used to favour production data for development and testing, but with the introduction of the Data Protection Act in the UK, doing so became something of a headache.
It is still worth testing on a database of representative size*, but these days I would advocate using synthetic data. Production data can be of rather variable quality, and may not be available at all during the development phase. Synthetic data, if you know the algorithms used to generate it will have predictable characteristics, so you can run a query and whilst you may not know what you will get back, you should know exactly how big the result set should be.
* So often, I have seen systems go live only to be beset by severe performance problems which needed to be addressed urgently. Using a full size database lets you find these problems before go-live.
Many years ago, I worked on a CRM system. I was tasked with populating the database with enough test data to make the database size about the same as the live system would have. I wrote a program to generate it, populating all the tables. It took a couple of hours to run, but it gave data of predictable characteristics, and you could throw the database away and start again without too much pain.
Usually, the data migration will happen close to go-live, so the testing on production data can happen at the User Acceptance testing (UAT) phase. Sure, there will be things in the production data that trip up the application. But letting the users test on the production data means they will take care of it the way they know how, and developers don't have to have the responsibility of it.
August 22, 2014 at 3:43 am
john.riley-1111039 (8/22/2014)
........* So often, I have seen systems go live only to be beset by severe performance problems which needed to be addressed urgently. Using a full size database lets you find these problems before go-live.
......
Totally agree, its very difficult for a developer not to tune or ask for help to tune when their work is too slow for production.
With today's hard drives there is no reason why even a 100GB database can't be restored quickly to test/develop their code against.
August 22, 2014 at 6:06 am
I am for having the right data in the right environment.
Dev databases should have a data set that is representative of Production but smaller. All known scenarios should be covered. This includes having some bad data in Dev if the Prod system can allow it to get in there. You won't always know what these scenarios are until you run into them in Prod but when you do, just add the scenario to the test data set. The point of the Dev environment is to provide developers with enough data to test their stuff without hindering productivity. Security hinders productivity so there should be no sensitive data in Dev.
So, how do we test properly against the full data set to verify performance and quality. I argue that it's not the developers' job to do this. It belongs in the hands of the Quality Assurance (QA) team. The QA team should be testing on their own set of servers with their own data. This would be a smaller group of people so testing against Prod data shouldn't be as big of an issue. If it is an issue then you need to use test data here but they should be comparable to Prod in size as well as data distribution. The QA environment is also a good place to do regression testing. Go ahead and set up all your nightly jobs and unit tests to run against QA as well as Prod. That should help you discover those unknown scenarios in your new code before it gets to Prod.
[font="Tahoma"]Bryant E. Byrd, BSSE MCDBA MCAD[/font]
Business Intelligence Administrator
MSBI Administration Blog
August 22, 2014 at 6:17 am
Tatsu (8/22/2014)
I am for having the right data in the right environment.
I don't see the point. If production data can be obfuscated while maintaining similar statistical characteristics, and if there's space, then why bother pruning the data? You have the extra cost of pruning, and all you achieve is to delay finding problems until the developer's mind is well and truly elsewhere.
August 22, 2014 at 6:20 am
A separate QA Team? Wow, that's a luxury where I work.
Development needs to spend some time on databases that reflect real data and large datasets. There's nothing like actual data and huge amounts of it to bring designs back to the drawing board. Also users/clients with the largest databases tend to be the ones that bring the most revenue or have a high profile.
As a developer/analyst/ETL/part time admin guy, there's nothing like dealing with the actual data. Especially when dealing with table design and ETL processes.
August 22, 2014 at 6:31 am
In most cases, I like to develop against a full data set for the reasons already outlined. Yes, I use the statistics and check the plans, but they only take you so far. I haven't found a real substitute for performance testing against production-sized data sets.
That being said, it is a pain to build them. The simplest approach is to just copy the production database, but then you have the privacy problems that so many of us (hopefully all of us) deal with. Yes, you can downsize everything, but then you're testing with subsets.
I think a good approach is to write a script that anonymizes the data. If the database is structured properly, you should only have to clear out the PII, but then you get into the definition of PII. If it's the data that directly identifies an individual, the task is relatively simple. If it includes data that directly or indirectly identifies an individual, it the scope becomes larger and takes a lot more time to write. I know the "structured properly" is a big assumption in many cases, but that's another discussion of bad design having broad-reaching implications. The complexity of the database is a big determining factor in how big the anonymize script is going to be. Ideally, should we let the cost of "doing it right" determine whether or not we do it right?
Of course, the anonymizer approach does come with a price tag...that being the time it takes to write and maintain the script to anonymize the data. But, given the performance testing we should all be doing, is there any other alternative? I suppose it would be to test in production.
August 22, 2014 at 6:33 am
david.wright-948385 (8/22/2014)
Tatsu (8/22/2014)
I am for having the right data in the right environment.I don't see the point. If production data can be obfuscated while maintaining similar statistical characteristics, and if there's space, then why bother pruning the data? You have the extra cost of pruning, and all you achieve is to delay finding problems until the developer's mind is well and truly elsewhere.
I hate to use cliches but time is money. I don't want my developers sitting around waiting for an SSIS package to load millions of rows when they only need to load a hundred thousand or so to satisfy the tests. I want them spending their time solving problems not waiting on mundane tasks.
We actually use Prod data in Dev and QA where I work but I am pushing to stop that practice, at least with the Business Intelligence team. We have all sorts of sensitive information in our data warehouse. As a result, we have some convoluted security practices to prevent everyone on the team (about 15 people plus contractors) having access to all the data. These practices make it harder on the developers and make the solutions harder to deploy.
If we could eliminate both of those problems for the developers their productivity would far outweigh the cost of setting up test data. The investment in setting up the initial data set is my biggest hurdle. Maintenance of the data set would be minimal; probably even less than refreshing the data from Prod on a regular basis.
[font="Tahoma"]Bryant E. Byrd, BSSE MCDBA MCAD[/font]
Business Intelligence Administrator
MSBI Administration Blog
August 22, 2014 at 6:42 am
chrisn-585491 (8/22/2014)
A separate QA Team? Wow, that's a luxury where I work.Development needs to spend some time on databases that reflect real data and large datasets. There's nothing like actual data and huge amounts of it to bring designs back to the drawing board. Also users/clients with the largest databases tend to be the ones that bring the most revenue or have a high profile.
As a developer/analyst/ETL/part time admin guy, there's nothing like dealing with the actual data. Especially when dealing with table design and ETL processes.
We don't actually have a QA team in my group either; however, at some point the developers need to break out of "build it" mode and get into the "break it" mindset. You don't want the "break it" guys working on the same servers as the "build it" folks though.
I am not saying that you shouldn't test or troubleshoot against Prod data at some point. Developers shouldn't have to trudge through all that data to get their work done though. We have a multi-terabyte data warehouse. The BI Developers that work on it usually use a subset of Prod data to do their initial development and I would say that less than 5% of their work requires performance troubleshooting. If that's the case, why do I want them to work against the full data set all the time? It's just going to slow them down.
[font="Tahoma"]Bryant E. Byrd, BSSE MCDBA MCAD[/font]
Business Intelligence Administrator
MSBI Administration Blog
August 22, 2014 at 6:57 am
Tatsu (8/22/2014)
...at some point the developers need to break out of "build it" mode and get into the "break it" mindset.
I'm sorry, I think that's a fundamental error. Developers need to write code with testing in mind and test it as they go. If they're in 'build it' mode without a 'break it' mentality, you're going to get lots and lots of errors just when you don't need them.
And working with production data doesn't have to mean you have to wait for an SSIS package to run on all the data. If designed with test in mind it's quite easy to limit their scope and hence reduce run time without limiting the data they operate on. Then maybe run a full scope test when the devs go home at night to make sure it's going to complete in the live context within the time available. Otherwise when you finally let things loose on the live data, there's no time or money to fix it.
Viewing 15 posts - 1 through 15 (of 42 total)
You must be logged in to reply to this topic. Login to reply