If you are holding, in your organisation, personal data about real people or commerce, it is wrong, and in many cases illegal to do database development work or testing using your production data. This, of course, probably applies to a minority of database systems, but data breaches caused by attacking backups or copies of production data within development are increasing.
Within the industry, two solutions have been suggested to the obvious need to do aspects of development and testing on data that is as near to production data as possible, in terms of quantity, distribution, nature and appearance. These solutions are masking and simulation. Both techniques are useful for other purposes in development but neither really fit the bill for performance testing as part of deployment.
The problem with masking, also referred to as obfuscation, is that you have two choices. Mask lightly so that the data keeps close to the current distribution, or mask effectively and lose your distribution. A ‘light’ masking is crackable, in much the same way that encryption is cracked, and there have been successful breaches in consequence. Effective masking takes the data from its distribution so it becomes pointless. After all, you need production data to test queries and processes on data of production distribution. Dynamic Data masking, by the way, is for a very different purpose and remains an effective way of doing aspects of application UAT.
The problem of simulating database data is that it is currently done for a different purpose. It can fill a bunch of related tables with great aplomb, but can’t simulate the way the world works, and therefore get close to the distribution and nature of production data. Take for example, a company hierarchy: If this includes dates and times, it would need to simulate typical careers, patterns of promotion and recruitment: it would need to accurately copy the churn of employees, possibly different at different levels of the hierarchy. Take another example, the pattern of website purchases is highly variable through weekly, monthly and yearly cycles. You’d need to simulate national holidays and sports events. From my own experience I can tell you that the convincing simulation of website traffic or phone-call information is as difficult as predicting the future. Even advertising breaks within popular television programs can cause a big blip.
We should, I believe, grasp the nettle. The facts seems to be that neither masking nor simulation will help much. Why do we need production data? If it is to check query performance as part of unit test, then use the statistics from the production system, disable refresh, and perform your query on simulated data. It works well-enough for me. If we need it to learn how to handle lots of data, use an open data real database. If you really, really need to do a batch of performance tests as part of deployment, then run these tests in staging under conditions agreed with your auditor as being compliant and secure. We need to think laterally.