Production Subsets

  • The BI Developers that work on it usually use a subset of Prod data to do their initial development and I would say that less than 5% of their work requires performance troubleshooting. If that's the case, why do I want them to work against the full data set all the time? It's just going to slow them down.

    No one has stated that developers need to work against a full data set all the time. They need to run specific tests and consider the cost of certain processes against a large data set.

    Example:

    Right now I'm working on a ETL job that will load a database estimated to be in the Terabyte range. I'm testing specific packages on a development server, but I'll do integration and full testing on a server that is similar to the production server.

    I don't need full production data yet, but I guarantee that it will be tested with it. Right now I'm working with a subset of the data, which like most data from others, has it's unique challenges.

  • I agree with using the full production set obfuscated. It's not hard to write a script to do this, there are also testing tools that will build pretty decent fake data. This way developers will naturally gravitate to an optimized solution. Why wait on a develop/build/test/fix cycle when you can catch most of it at the time right in the development step? Of course you should have performance tests, etc. but it's all about keeping the cycle as short as possible.

    Also, in my experience, it would be nice if everything was always known up front, but that's typically not possible, especially when you develop solutions for sale, no matter what kind of fence you put around a solution, the sales team will drive over it with a bulldozer.

  • What about new projects where the production data is really small but will grow over time.

    One issue I've seen is where the developers create something and it works great while they only have a few thousand records but then after a couple of years, the application slows down dramatically. This is when I am asked to fix the database but the issues are due to some really bad design decisions that were made before they brought the DBA team in. So we look at the system and the way it was designed will not scale. Often these systems are built by consultants who are long gone when the system develops scalability issues.

    What I am saying is using production data may not be the best option early in an application's life cycle. Sometimes you need to think ahead and generate test data appropriate to the size of the solution after it's been used for a couple of years.

  • My preference is for developers to work on a subset (defined set of development data) locally for doing development and unit testing. Then pushing changes to a shared database that has automated testing against more production-like set of data.

  • Developing and (some) testing against a subset doesn't mean that you never use production data, or even greater-than-production sized sets. It means that you do lots of your development against small sets so things run fast.

    You can still add in edge cases, NULLs, and other unexpected data into a small set to test things. As you do performance test against larger sets, you should learn what code works and doesn't and then bring those techniques into your toolset for the next development.

    There's no need to do most development and testing against large data sets. Just as there's no need to ignore tests against production sized/skewed data sets.

  • We have a small set of data on our DEV environment. But we only use this area to basically test that the code does not fail. Then to do real testing on the performance and logic we move it to an INT environment that contains full production data. I agree you need full production data to truly test your code before moving to production.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • I still haven't seen a methodology that is more thorough and efficient then when I was a noob mainframe programmer 20 years ago at a large insurance company. Basically, the routine is: unit and system/integeration testing uses a sampled subset of your production data while pre-production testing uses all of it. Usually that ideal can be achieved.

    Today, though, on the most massive systems I would sample and make a respectable effort to maintain a decent regression test plan. Management has to be convinced of it's value though. Maybe a few more flash trading crashes on Wall Street will get the message through to a new generation...

  • My current environment is a medium-sized database for a lightly-customized commercial application. We use full production data for development and testing.

    In my days at an ISV, I dealt with obfuscation and pared-down datasets. We tried to replicate customer environments as closely as possible, but this can be tricky when your customers have deeper pockets than you. Sometimes the most important part of a rollout was to remind the install team to verify their customers had current backups.

    I'll echo many opinions here. The key is to run final tests in an environment as close to production as possible.

  • All my systems are medical records, and also have personal data like SSNs and birth dates. Basically every thing you need to do identity theft.

    so I can't use production data to test for more than 90 days.

    What I've been looking for is something similar to a ROT-13 SQL function to scramble the data, but haven't been able to find one.

    So the problem always is that the test build data set is always small and not anywhere near an example of production data. So it has always been an emergency tuning the new sprocs and production SW.



    ----------------
    Jim P.

    A little bit of this and a little byte of that can cause bloatware.

  • For very sensitive data I will test against subsets othewise I tend to test against the main data.

    cloudydatablog.net

  • Currently, I work in a shop with full access to production data - and an environment where queries exceeding 5 secs are deliberate and exceptional (because I need a "quick" answer and figure it'll take me longer to fiddle the code than to just run the lazy guy's "out-of-the-box-code" with bad performance).

    And I love it! Bugs are rare - as in: Bugs occur, when I didn't understood data correctly - or was misinformed as to what can actually "come through the gates". (We have some specificities which happens once a year. Somehow, Front People tend to ignore these when interviewed about use and needs.)

    However, in the past I worked in another shop where the data (and the servers) were literally in another country. And I had NOT qualified for a premium communication line to that server park!!

    As a result, a query on production data could easily take 50 minutes... :blink:

    It is some years ago now, and hardware is getting faster. But. We enter the Big Data era now.

    And yes, writing against full data is preferable, but I learned back then to love scripts and tools to generate data suitable for the different tasks: small, but diverse sets for the first prototypes; big, but controlled sets for the performance tests; big and corrupted sets for the error handling tests.

    The crucial thing for using artificial data in development work is the ability to generate a data set, right where you need it, right as you need it - with a couple of clicks.

  • crussell-931424 (8/25/2014)


    What's the compelling reason to use a subset of data? If it is no big deal to copy a recent backup into a test area, why not just do it? This works well for us. We have four test areas that we do this in, used for various purposes.

    If you are running thousands or even tens of thousands of automated tests as part of Continuous Integration you want the tests to complete as quickly as possible. Obviously, performance tests take longer and, therefore, often form part of a different CI build.

    Furthermore, if you are suggesting a recent backup of production data then in the EU that is a huge no-no. It is illegal as the use of non obfuscated data is not justifiable in standard use. Only diagnostics for an explicit identified issue can allow the temporary and limited use of an individual's data. Of course, your data might not even be of that type 😉

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • crussell-931424 (8/25/2014)


    What's the compelling reason to use a subset of data? If it is no big deal to copy a recent backup into a test area, why not just do it? This works well for us. We have four test areas that we do this in, used for various purposes.

    The compelling reason is that it is a big deal for many places. If it's not, then sure, copy a backup of production. If it is, then that doesn't work.

  • I have gotten to the point that I am insisting on using production data for development.

    Where I have worked in the past few years (health care EMR and governmental work), the input applications (either legacy or vendor) do no real data validation in the application.

    This means that I can find all sorts of weird data in my tables.

    Oh, the 30 week baby weighs 2400 lbs? Don't you mean 2400 grams?

    No, I have no control over the input programs. I am just writing queries and ETL packages in a DW.

    Our analysts added 'test' data to our test system, but I found that they added the data in correctly and without the 'artistic license' that our users employed.

    It is the incorrect use of a column that I need to be prepared for so my queries don't blow up on my users.

    Using production data allowed me to develop my scripts faster with fewer errors and with better performance.

    When your database has 1500 tables and you are having to join 10-20 tables, you really need to be using data that is currently in the system so you can make sure your code works for the data you have.

    No, this is not ideal. But, it is the only way we were able to function with the restrictions placed on us.

    Basically, if you are working with a very complicated data set (Electronic Medical Record systems are a perfect example) with minimal input validation or a test environment that has a minimal amount of 'best case' data, then developing off of production data is more than likely your best choice.

  • crussell-931424 (8/25/2014)


    What's the compelling reason to use a subset of data? If it is no big deal to copy a recent backup into a test area, why not just do it? This works well for us. We have four test areas that we do this in, used for various purposes.

    The first compelling reason may be that it's illegal to do that.

    If the data contains anything that counts as personal data - which has a fairly simple legal definition which, contrary to common myth, includes ip addresses in most circumstances - the compelling reason in some jurisdictions is that it makes the data visible to development and test people, which, throughout the European Union, is contrary to law, and anywhere that processes data obtained from the European Union is contrary to international treaty (eg the EU-USA Safe Harbor Agreement) and in the US that makes it contrary to Federal law; there are exceptions when the people to whom that data is made available need it (all of it, not just bits of it) for the purpose of preventing serious crime or for another allowed purpose, but testing isn't an allowed purpose. Most so-called anonymisation techniques are in practice not really anonymising, so anonymised data is often still legally personal data. The USA is already on the verge of being told it's financial and banking businesses and its federal authorities will no longer be provided access to any EU financial and banking information and the main Swift processors will be removed from US hosts and relocated in the EU because Safe Harbor is being breached far too often (by the Feds as well as by private industry) and it seems likely that enforcement will become serious because that would be highly undesirable for the USA.

    An equally compelling reason is that if one does all testing on a complete set of production data the testing may be very slow - a lot of testing can be done on a small dataset; another is that if one uses only production data one can do none of (i) adequate stress testing (ii) adequate performance testing and (iii) testing new features that need additional tables or attributes, because production data is not suitable for any of that.

    If you can find a decent anonymisation function for your data so that it ceases to be personal data (many have tried and failed) you can use an anonymised copy of production data - but the it must NOT be possible to de-anonymise it by any method, including by looking for patterns that match other data you have access to or are likely to come into possession of, if it's covered by EU law or by the Cold Harbor agreement or similar regulation. But testing only against that will not be adequate testing, since it fails performance testing, stress testing , and new feature validation. Or if your data doesn't include anything coverd by Safe Harbor you can do what you like with it (subject of course to USA law that might make trouble for you if you provide it to an employee and he or she misuses it).

    Tom

Viewing 15 posts - 16 through 30 (of 42 total)

You must be logged in to reply to this topic. Login to reply