April 21, 2016 at 4:37 pm
I have several tables in a database that are not even close to being normalized but that's the way an application we have stores the data. I'm trying to get it into a normalized form so that I can join it to other tables.
Here's an example of its structure:
PersonID, Degree, Ethnicity, Hobby
100, BS, Caucasian, Basketball
100, MS, Hispanic, Boxing
100, PhD, NULL, Baseball
100, NULL, NULL, Tennis
100, NULL, NULL, Soccer
On a record by record level, the information isn't related. It's like each PersonID + Column pair individually is related and should be its own table (where the column is not null). For example:
SELECT PersonID, Degree
FROM table
WHERE Degree IS NOT NULL -- would return three records (as it should)
SELECT PersonID, Ethnicity
FROM table
WHERE Ethnicity IS NOT NULL -- two records
SELECT PersonID, Hobby
FROM table
WHERE Hobby IS NOT NULL -- five records
I want to be able to write queries against any one of these derived tables (in a normalized form) and join them to, as I said before, other tables from different data sources. Since I can't change the actual structure of the table, you might think views are the way to go. I could create views, but if I have 10 fields in this table and 20 tables in this form, that's 10 fields x 20 tables = 200 views. And I can tell you I definitely have more than 20 tables like this.
I thought a generic table-valued function might be the ticket here, but I haven't yet been able to get anything to work. Does anyone have any thoughts on how I might accomplish this?
Thank you,
Mike
Here's T-SQL to create my current situation:
CREATE TABLE [dbo].
(
[PersonID] [int] NOT NULL,
[Degree] [varchar](50) NULL,
[Ethnicity] [varchar](50) NULL,
[Hobby] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
INSERT [dbo].
([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, N'BS', N'Caucasian', N'Basketball')
GO
INSERT [dbo].
([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, N'MS', N'Hispanic', N'Boxing')
GO
INSERT [dbo].
([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, N'PhD', NULL, N'Baseball')
GO
INSERT [dbo].
([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, NULL, NULL, N'Tennis')
GO
INSERT [dbo].
([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, NULL, NULL, N'Soccer')
GO
Mike Scalise, PMP
https://www.michaelscalise.com
April 21, 2016 at 11:49 pm
Thanks for posting the script.
You may not need many views from the same table.
Will a view like this work for you?
CREATE VIEW dbo.PersonProfile
AS
SELECT t.PersonID, 'Degree', t.Degree
FROM dbo.
t
WHERE t.Degree > ''
UNION
SELECT t.PersonID, 'Ethnicity', t.Ethnicity
FROM dbo.
t
WHERE t.Ethnicity > ''
UNION
SELECT t.PersonID, 'Hobby', t.Hobby
FROM dbo.
t
WHERE t.Hobby > ''
_____________
Code for TallyGenerator
April 22, 2016 at 6:03 am
Thanks, Sergiy.
That would reduce the number of views I'd have to create, but I think it'd be a similar amount of typing in anything that joins to the view. For example, as it stands, I would probably query like this
table1 = normalized table from another data source
table2 = the table I'm trying to exclude nulls from
view1 = the view with several tables UNIONed, NULLs excluded, and item names in their own column
SELECT *
FROM table1 a
INNER JOIN (SELECT PersonID, Degree FROM table2 WHERE Degree > '') b
ON a.PersonID = b.PersonID
Even if I created the views in the way that you're suggesting, I'd end up with a query that looks like this:
SELECT *
FROM table1 a
INNER JOIN (SELECT PersonID, Degree FROM view1 WHERE item = 'Degree') b
ON a.PersonID = b.PersonID
It's like I'm just substituting the removal of NULLs in the WHERE clause for the specifying the name of the "item" in the WHERE clause.
I have tried something like this table-valued function, where I pass the table that's not in the right form, the name of the ID column I plan to join on, and the field whose non-NULL values I want to return:
SELECT *
FROM table1 a
INNER JOIN tbl_val_fn('table2', 'PersonID', 'Degree') b
ON a.PersonID = b.PersonID
The function would look like:
SELECT @IDField, @ItemField
FROM @TableName
WHERE @ItemField IS NOT NULL
Any thoughts on this and if I'm even heading down a feasible path?
Thanks,
Mike
Mike Scalise, PMP
https://www.michaelscalise.com
April 22, 2016 at 2:01 pm
will the following satisfy your need for simplicity?
create view [table_normalized]
as
select t1.PersonID
, t1.col_name
, t1.col_value
from ( select t1.PersonID
, t.colnumber
, case
when t.colnumber = 1 then 'Degree'
when t.colnumber = 2 then 'ethnicity'
when t.colnumber = 3 then 'Hobby'
end as col_name
, case
when t.colnumber = 1 then t1.Degree
when t.colnumber = 2 then t1.Ethnicity
when t.colnumber = 3 then t1.Hobby
end as col_value
from dbo.
t1
cross apply (select top 3 row_number() over (order by num) as colnumber
from (values (1)) t (num)
group by cube (num, num, num) -- expand and change the top 3 to allow for the number or columns you have
-- from (values (1),(1),(1),(1),(1),(1)) t (num) -- use this instead of the from and group by if they fail on your version of Sql
) t
) t1
where t1.col_value is not null
and then use it as
SELECT *
FROM table1 a
INNER JOIN table_normalized b
ON b.col_name = 'Degree'
and a.PersonID = b.PersonID
April 23, 2016 at 5:45 am
Frederico,
That's an interesting approach. Am I correct in assuming that I'd need to create a case statement containing every possibility for field names across all of the tables I have that are in the funky format? There are probably hundreds.
Mike
Mike Scalise, PMP
https://www.michaelscalise.com
April 23, 2016 at 1:58 pm
yes that would be the case - but easy enough to build the case statements in Excel for example - this is something I had to do on the last projects I worked on the last 3 years.
One view per table.
Note that you may eventually consider "materializing" the views for performance reasons - but probably only feasible if they aren't updated frequently.
May 19, 2016 at 8:45 am
frederico_fonseca (4/22/2016)
will the following satisfy your need for simplicity?
create view [table_normalized]
as
select t1.PersonID
, t1.col_name
, t1.col_value
from ( select t1.PersonID
, t.colnumber
, case
when t.colnumber = 1 then 'Degree'
when t.colnumber = 2 then 'ethnicity'
when t.colnumber = 3 then 'Hobby'
end as col_name
, case
when t.colnumber = 1 then t1.Degree
when t.colnumber = 2 then t1.Ethnicity
when t.colnumber = 3 then t1.Hobby
end as col_value
from dbo.
t1
cross apply (select top 3 row_number() over (order by num) as colnumber
from (values (1)) t (num)
group by cube (num, num, num) -- expand and change the top 3 to allow for the number or columns you have
-- from (values (1),(1),(1),(1),(1),(1)) t (num) -- use this instead of the from and group by if they fail on your version of Sql
) t
) t1
where t1.col_value is not null
and then use it as
SELECT *
FROM table1 a
INNER JOIN table_normalized b
ON b.col_name = 'Degree'
and a.PersonID = b.PersonID
Hi Frederico,
I'm just curious--why did you use GROUP BY cube instead of GROUP BY rollup. Also, is it necessary to use "num" three times in "group by cube (num, num, num)". Couldn't you just list it once to get the same results? "group by cube (num)"
Thanks,
Mike
Mike Scalise, PMP
https://www.michaelscalise.com
May 19, 2016 at 12:16 pm
Hi Mike,
group by cube and group by rollup do not give same results. so cube is the correct one for this purpose.
as for why the 3 num and not just 1 - again this is how cube works.
following code should show you what I'm trying to do here - and this is just one of the many ways of generating a sequence of numbers on the fly. not necessarily the fastest but for small numbers difference with other methods is probably negligible. I did try and find a post comparing several methods but could not find one that also uses this method - maybe someone here on the forums has a link to such a compare post.
select '2 to the power of 1', count(*)
from (
select row_number() over (order by num) as colnumber
from (values (1)) t (num)
group by cube (num)
) t1
union all
select '2 to the power of 2', count(*)
from (
select row_number() over (order by num) as colnumber
from (values (1)) t (num)
group by cube (num, num)
) t2
union all
select '2 to the power of 3', count(*)
from (
select row_number() over (order by num) as colnumber
from (values (1)) t (num)
group by cube (num, num, num)
) t3
union all
select '2 to the power of 4', count(*)
from (
select row_number() over (order by num) as colnumber
from (values (1)) t (num)
group by cube (num, num, num, num)
) t4
union all
select '2 to the power of 5', count(*)
from (
select row_number() over (order by num) as colnumber
from (values (1)) t (num)
group by cube (num, num, num, num, num)
) t5
union all
select '2 to the power of 6', count(*)
from (
select row_number() over (order by num) as colnumber
from (values (1)) t (num)
group by cube (num, num, num, num, num, num)
) t5
and you can also try the inner selects on their own to see the numbers being generated
May 20, 2016 at 12:53 pm
Got it. Thanks so much!
Mike Scalise, PMP
https://www.michaelscalise.com
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply