December 13, 2006 at 2:21 pm
i am compiling a data dictionary with the usual data in it. since new variable are added all the time so i may not have data for many variables say 5 years ago. or may be certain variables were not used for a give time period and have all nulls etc..
how would i go about to fetch that information? there is no other input source, just the data itself. luckily there is a variable called questid which is part of the key and contain numeric data representing a timeframe.
for example, here is one variable in question ... and i have 5000 more to look at
select rash,questid, count(*) from tableName
group by rash,questid
which give me this (abridged) result
NULL1013003.01
21013003.02
01013003.012
11013003.01
01013004.04
11013004.03
01013005.01
NULL19980201.03814
NULL19990101.07723
NULL19990201.012001
NULL20000101.011494
this variable was not used in any of the bottom 4 results i.e. all nulls , it was used for others because there is at least one non-null value
is there a better way of doing this? also i want to break down and only know the questid for a giver var if it was used or not used ..... just questid would be enough
timeframe moves forward in time and i guess it would be possible to find the first timeframe with at least one non-null value so i can assume the first time any give variable was introduced/used ?
thanks and sorry for sounding complicated.
any help is welcome!
December 13, 2006 at 7:45 pm
we need more info here. what is the schema of your table. what do you mean when you say variable?
December 13, 2006 at 9:20 pm
sorry if i was not clear enough.
there is more than one table , about 20 with average number of colums in each tabel about 300
each table have same PK called PtKey(varchar 12) and Questid (float)
variable means column in a given table, sorry i am conventional programmer.
i have a working data dictionary which has usual information about each column i.e.
columnName, TableName, DataType, MinValue , MaxValue, Constraint etc..
the columns in this database numbered about 1000 10 years ago, now much more than that.
there is no paper trail to tell when exactly any given column was added to the database.
data comes from a research survey taken every six months, so each Questid represent that timeframe
e.g. 19990101 roughly means first quarter of year 1999
there are generic questid's representing one time data collection e.g. 1011
some questions are not asked in a given timeframe i.e. questid so all the data would be null for that timeframe for those columns.
my task is to look at the data for each column and 'decide' which phase/timeframe/questid it was used -- True or False for each questid, also the first occurance of any data that would dictate the first time use of that column e.g. column HMO was first used in 19990101 and after that it was used in (19990201,20010101,20050201 ...)
there is one more step after that which makes it little more complex but i should not go there yet i guess.
my thought process is to group by each column with questid and look at the data, if there is not a single non null value for a given questid i can say it was not used in that timeframe/questid ?
but i am not adept at tweaking it further to get where i want to be.
thanks for any help, it seems most people dont have any answer for me
December 14, 2006 at 9:09 am
Hello,
i think it would really help if you can post the table DDL (CREATE TABLE .... ) for at least one table as an example.. with a few rows of sample data (INSERT INTO ....). Then we would know what are the data types and what are the values you have to work with.
So far it seems that the database design is not good - as I understand it, there is one column for each question that was ever asked (to store what who replied to this question). It would be a lot easier if questions are stored in another table and in this table you'd have only 2 columns - questionID and response - instead of separate column for each question. (I'm only talking about "question columns", but I suppose there will be some other as well).
Generally speaking, you have to name each column separately to find out when it was used. Something like this could help:
SELECT LEFT(questid,8), COUNT(col1), COUNT(col2),COUNT(col3).....
FROM yourtable
GROUP BY LEFT(questid,8)
COUNT(<column name> ) counts the number of non-null values in the column, contrary to COUNT(*) which counts all rows, inclusive those with NULLs. If there are only NULL values in a certain period, respective column will contain 0.
You can add some sorting and other nice things, or insert the result into a table and then play with it as required... hope this will help you!
December 14, 2006 at 11:42 am
thanks for your input, it did help of course.
U are right in regards to database design, it was not designed but built on need basis long time ago. there are some strings as to how much i can change it but i sure will pick some tips from you.
what u suggesting is, correct me if i am wrong, to build two (at least) tables. one table should only have questions listed , something like this
Label/Question assingedName/var DatedAdded DataType Values .....
your gender? sex 12/1/2006 number 1 Male 0 Female
etc .....
The other table would have this information
Questid assignedName/var Answer DateRecorded
1011 sex 1 12/12/2006
Another table can have def. of all the questid's
hmmmm, first thing that strikes me is the sheer length of this table e.g. 10,000 people fill out a survey that have about 1500 answers , so that would be 15000000 rows ever six months or so, does that not sound too big to you?
i like the design but it seems i would end up with millions of rows in a single table in just couple of years.
Let me know of any input. Thanks
December 14, 2006 at 11:49 am
I don't see why this would be a problem. This site alone sends over 500 000 e-mails / day (ok the table is often purged but still). And it's nothing compared to those terabytes databases.
Do you mind telling us what you programming / sql background is?
December 14, 2006 at 12:14 pm
you are right that its not HUGE, but as you know its a relative term. I can not purge the database ever or even take the 'old' data offline/archive.
database is accessed daily in working hours by up to 50 users. Most if not all need to look at the data, whole data at times, for research,QA,Validation etc. , there are front ends to the database to control the flow of the data but some people need just raw access to it. so it may generate lot of network traffic if i were to have a table with 100 million rows?
Oh i just have masters in computer science, started programming in Fortran a while back now i use VB,C,C# , i can read SQL and write it too but not at the higher level, may be i should spend some time on it. But i spend more time on other issues , more into administration than coding.
December 14, 2006 at 12:45 pm
Well the thread title is very fitting then. You are in way over your head. Probabely much more than you can imagine. You need an experienced DBA on this project. Not just help on a few questions in and there. This forum can help you limp along on this project, but I strongly feel you need much more than that!
December 15, 2006 at 1:26 am
I absolutely agree with that... it is the same as if I would start to design a commercial application in VB based on my knowledge of Basic from ZX-81 and Atari. I can understand great part of the VB code I read, but that's by far not enough to write any useful code.
Similarly, you need a database specialist with SQL Server knowledge if you want to think about redesigning a database. 100M rows is a lot, but it isn't anything to be afraid of, and there are many ways to handle it. If done properly, it will be much better than your current situation - both in terms of performance and maintenance.
As to your original question, I think the code I posted before should do the job - at least if it is something you need to do once. If you need to keep this summary info (which question was used in what period) up-to-date, you can store the history in a permanent table and have stored procedure started by SQL Agent every so often to refresh data for the actual period. But even for that you need someone who knows how to write SQL properly - we can give you hints and show best practices etc., but we can't write the code. That has to be done by someone on-site, who knows all requirements, data structure and who knows SQL.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply