September 27, 2006 at 3:19 pm
I am looking for a query that will give me a 'sample' of data from each of several large tables.
Each table has 3-7 columns that I want to be distinct (so I get all the combination of values in those columns that exist in the actual data) and I also want the associated 3-15 columns that don't have to be distinct. I just want whatever is available in the rows picked by the first query.
For example:
ID CO#* REASON* GENDER* PROD TYPE* NAME AMT CITY ST*
1 2 14 M BLUE JOHN 10 CINCY OH
2 3 23 F BLUE JANE 12 DAYTON OH
3 2 14 M BLUE DAVE 14 CINCY OH
I would like to get all the distinct values on the *(asterisked) columns plus the other columns that don't need to be distinct. If I picked just one sample of each distinct group I would get record ID 1 and 2, 3 would be skipped because on the asterisked columns it isn't distinct from 1. I would like to be able to pick the number of instances that I would like to see, i.e. if I wanted 2 samples of each distinct group then I would get all three rows in my example.
Hope that is clear.
Thanks!
Dave
October 2, 2006 at 8:00 am
This was removed by the editor as SPAM
October 5, 2006 at 3:43 pm
probably because this doesn't have anything to do with business intelligence... and it is a problem/issue/situation that everyone has - and there is usually no solution short of just putting it all together to fit your needs. I worked for Boeing for five years and for a long time was on NASA's Space Station Freedom project... where I lead a "test data" team of five software engineers. There are some automated tools on the market, and typically, they aren't worth the trouble. There is no easy way to write such a tool that will fit everyone's needs. You just have to do the analysis, and make your data, as required.
Thank-you,
David Russell
Any Cloud, Any Database, Oracle since 1982
October 5, 2006 at 5:20 pm
Looks like an opportunity.
If everyone has the problem, then everyone would be interested in a generic solution.
I'm not a SQL guru but I understand that you can get a list of tables and a list of columns from the systems tables with a 'generic' query. Something like:
Select
sys.all_objects.name as TABLE_NAME,
sys.all_columns.[object_Id],
sys.all_columns.name AS COLUMN_NAME,
sys.all_columns.[column_id],
sys.systypes.name AS DATA_TYPE,
sys.all_columns.max_length as CHARACTER_MAXIMUM_LENGTH,
sys.syscomments.text as COLUMN_DEFAULT,
sys.all_columns.[is_nullable] as [Is_Nullable]
FROM sys.all_columns INNER JOIN sys.systypes
ON sys.all_columns.system_type_id = sys.systypes.xtype
LEFT JOIN sys.all_objects ON sys.all_columns.[object_id] = sys.all_objects.[object_id]
LEFT OUTER JOIN sys.syscomments ON sys.all_columns.default_object_id = sys.syscomments.id
WHERE sys.all_columns.[object_id] IN
(SELECT [object_id] FROM sys.all_objects WHERE type = 'U')
AND (sys.systypes.name <> 'sysname')
ORDER BY sys.all_objects.object_id;
Use that to create a query that will pick unique values in columns (1 instance of each) and create a testing table that is much smaller than the actual table but with the variation of the actual table. Now you have a testing table... Perhaps someone could put together a query like that - I and many others would be interested. Since this is something that all of us are doing.
I guess I look at the source data as the key to understanding just what BI value it can bring. Not knowing the amount of work and the type of work I need to do to get the data conformed bites all BI projects and causes some of those BI projects to fail. Not being able to test the data rules for cleansing, transformation and standardization is a sure way to continue to make excuses why the data isn't 'good enough' when each run causes users to point out one more error.
October 5, 2006 at 6:49 pm
So, that is a solution for you; but it is not generic... and not all databases have data in them to query, at some point in the life-cycle the database is created from data models, and an application is written to access the structures. In a complex database there is this thing called referential integrity, and it might stretch across 7, 10, 20 or more tables, requiring data to be created with referencing keys and in the proper order.
Yes, it's an opportunity. There was a company that made a product called "Test Data Generator", sort of a generic name for a real product. I don't think they survived. We used the tool for quite a while; but the database we were dealing with had over 1700 tables, and over 10,000 attributes. Just writing the specs for the generator was a big task.
That's why Boeing has a team of five software engineers working on nothing but test data... and when I took over the team they were close to 60 days behind. At least we eliminated the backlog and got it down to two days within two weeks... but, no, it is not an easy task.
If you actually had a solution, it would pay you to become "a SQL Guru"
Cheers!
Thank-you,
David Russell
Any Cloud, Any Database, Oracle since 1982
October 5, 2006 at 10:55 pm
Just throwing in a couple of cents.... there seems to be a slew of products available (ranging from ~$150USD to $4k USD), try here, here, here and here.
Even our friends at M$ft have introduced 'data generation plans' in the VS2K5 for DB Professionals Edition (see DataGenerator, here).
I couldn't see an open source (ie preferably free) product out there but then again I only oogled the first google page returned
Cheers,
Steve.
October 6, 2006 at 7:36 am
Thanks!
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply