Eliminating NULLs when selecting from potentially several tables

Question

Post reply

Eliminating NULLs when selecting from potentially several tables

Mike Scalise

SSCrazy Eights

Points: 8145
More actions
April 21, 2016 at 4:37 pm

#306662

I have several tables in a database that are not even close to being normalized but that's the way an application we have stores the data. I'm trying to get it into a normalized form so that I can join it to other tables.
Here's an example of its structure:
PersonID, Degree, Ethnicity, Hobby
100, BS, Caucasian, Basketball
100, MS, Hispanic, Boxing
100, PhD, NULL, Baseball
100, NULL, NULL, Tennis
100, NULL, NULL, Soccer
On a record by record level, the information isn't related. It's like each PersonID + Column pair individually is related and should be its own table (where the column is not null). For example:
SELECT PersonID, Degree
FROM table
WHERE Degree IS NOT NULL -- would return three records (as it should)
SELECT PersonID, Ethnicity
FROM table
WHERE Ethnicity IS NOT NULL -- two records
SELECT PersonID, Hobby
FROM table
WHERE Hobby IS NOT NULL -- five records
I want to be able to write queries against any one of these derived tables (in a normalized form) and join them to, as I said before, other tables from different data sources. Since I can't change the actual structure of the table, you might think views are the way to go. I could create views, but if I have 10 fields in this table and 20 tables in this form, that's 10 fields x 20 tables = 200 views. And I can tell you I definitely have more than 20 tables like this.
I thought a generic table-valued function might be the ticket here, but I haven't yet been able to get anything to work. Does anyone have any thoughts on how I might accomplish this?
Thank you,
Mike
Here's T-SQL to create my current situation:
CREATE TABLE [dbo].
(
[PersonID] [int] NOT NULL,
[Degree] [varchar](50) NULL,
[Ethnicity] [varchar](50) NULL,
[Hobby] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
INSERT [dbo].
([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, N'BS', N'Caucasian', N'Basketball')
GO
INSERT [dbo].
([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, N'MS', N'Hispanic', N'Boxing')
GO
INSERT [dbo].
([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, N'PhD', NULL, N'Baseball')
GO
INSERT [dbo].
([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, NULL, NULL, N'Tennis')
GO
INSERT [dbo].
([PersonID], [Degree], [Ethnicity], [Hobby]) VALUES (100, NULL, NULL, N'Soccer')
GO
Mike Scalise, PMP
https://www.michaelscalise.com

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Sergiy SSC Guru Points: 110208 More actions · Answer 1

Thanks for posting the script.

You may not need many views from the same table.

Will a view like this work for you?

CREATE VIEW dbo.PersonProfile

AS

SELECT t.PersonID, 'Degree', t.Degree

FROM dbo.

t

WHERE t.Degree > ''

UNION

SELECT t.PersonID, 'Ethnicity', t.Ethnicity

FROM dbo.

t

WHERE t.Ethnicity > ''

UNION

SELECT t.PersonID, 'Hobby', t.Hobby

FROM dbo.

t

WHERE t.Hobby > ''

_____________
Code for TallyGenerator

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 2

Thanks, Sergiy.

That would reduce the number of views I'd have to create, but I think it'd be a similar amount of typing in anything that joins to the view. For example, as it stands, I would probably query like this

table1 = normalized table from another data source

table2 = the table I'm trying to exclude nulls from

view1 = the view with several tables UNIONed, NULLs excluded, and item names in their own column

SELECT *

FROM table1 a

INNER JOIN (SELECT PersonID, Degree FROM table2 WHERE Degree > '') b

ON a.PersonID = b.PersonID

Even if I created the views in the way that you're suggesting, I'd end up with a query that looks like this:

SELECT *

FROM table1 a

INNER JOIN (SELECT PersonID, Degree FROM view1 WHERE item = 'Degree') b

ON a.PersonID = b.PersonID

It's like I'm just substituting the removal of NULLs in the WHERE clause for the specifying the name of the "item" in the WHERE clause.

I have tried something like this table-valued function, where I pass the table that's not in the right form, the name of the ID column I plan to join on, and the field whose non-NULL values I want to return:

SELECT *

FROM table1 a

INNER JOIN tbl_val_fn('table2', 'PersonID', 'Degree') b

ON a.PersonID = b.PersonID

The function would look like:

SELECT @IDField, @ItemField

FROM @TableName

WHERE @ItemField IS NOT NULL

Any thoughts on this and if I'm even heading down a feasible path?

Thanks,

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

frederico_fonseca SSCoach Points: 16167 More actions · Answer 3

will the following satisfy your need for simplicity?

create view [table_normalized]

as

select t1.PersonID

, t1.col_name

, t1.col_value

from ( select t1.PersonID

, t.colnumber

, case

when t.colnumber = 1 then 'Degree'

when t.colnumber = 2 then 'ethnicity'

when t.colnumber = 3 then 'Hobby'

end as col_name

, case

when t.colnumber = 1 then t1.Degree

when t.colnumber = 2 then t1.Ethnicity

when t.colnumber = 3 then t1.Hobby

end as col_value

from dbo.

t1

cross apply (select top 3 row_number() over (order by num) as colnumber

from (values (1)) t (num)

group by cube (num, num, num) -- expand and change the top 3 to allow for the number or columns you have

-- from (values (1),(1),(1),(1),(1),(1)) t (num) -- use this instead of the from and group by if they fail on your version of Sql

) t

) t1

where t1.col_value is not null

and then use it as

SELECT *

FROM table1 a

INNER JOIN table_normalized b

ON b.col_name = 'Degree'

and a.PersonID = b.PersonID

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 4

Frederico,

That's an interesting approach. Am I correct in assuming that I'd need to create a case statement containing every possibility for field names across all of the tables I have that are in the funky format? There are probably hundreds.

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

frederico_fonseca SSCoach Points: 16167 More actions · Answer 5

yes that would be the case - but easy enough to build the case statements in Excel for example - this is something I had to do on the last projects I worked on the last 3 years.

One view per table.

Note that you may eventually consider "materializing" the views for performance reasons - but probably only feasible if they aren't updated frequently.

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 6

frederico_fonseca (4/22/2016)
will the following satisfy your need for simplicity?
create view [table_normalized]
as
select t1.PersonID
, t1.col_name
, t1.col_value
from ( select t1.PersonID
, t.colnumber
, case
when t.colnumber = 1 then 'Degree'
when t.colnumber = 2 then 'ethnicity'
when t.colnumber = 3 then 'Hobby'
end as col_name
, case
when t.colnumber = 1 then t1.Degree
when t.colnumber = 2 then t1.Ethnicity
when t.colnumber = 3 then t1.Hobby
end as col_value
from dbo.
t1
cross apply (select top 3 row_number() over (order by num) as colnumber
from (values (1)) t (num)
group by cube (num, num, num) -- expand and change the top 3 to allow for the number or columns you have
-- from (values (1),(1),(1),(1),(1),(1)) t (num) -- use this instead of the from and group by if they fail on your version of Sql
) t
) t1
where t1.col_value is not null
and then use it as
SELECT *
FROM table1 a
INNER JOIN table_normalized b
ON b.col_name = 'Degree'
and a.PersonID = b.PersonID

Hi Frederico,

I'm just curious--why did you use GROUP BY cube instead of GROUP BY rollup. Also, is it necessary to use "num" three times in "group by cube (num, num, num)". Couldn't you just list it once to get the same results? "group by cube (num)"

Thanks,

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

frederico_fonseca SSCoach Points: 16167 More actions · Answer 7

Hi Mike,

group by cube and group by rollup do not give same results. so cube is the correct one for this purpose.

as for why the 3 num and not just 1 - again this is how cube works.

following code should show you what I'm trying to do here - and this is just one of the many ways of generating a sequence of numbers on the fly. not necessarily the fastest but for small numbers difference with other methods is probably negligible. I did try and find a post comparing several methods but could not find one that also uses this method - maybe someone here on the forums has a link to such a compare post.

select '2 to the power of 1', count(*)

from (

select row_number() over (order by num) as colnumber

from (values (1)) t (num)

group by cube (num)

) t1

union all

select '2 to the power of 2', count(*)

from (

select row_number() over (order by num) as colnumber

from (values (1)) t (num)

group by cube (num, num)

) t2

union all

select '2 to the power of 3', count(*)

from (

select row_number() over (order by num) as colnumber

from (values (1)) t (num)

group by cube (num, num, num)

) t3

union all

select '2 to the power of 4', count(*)

from (

select row_number() over (order by num) as colnumber

from (values (1)) t (num)

group by cube (num, num, num, num)

) t4

union all

select '2 to the power of 5', count(*)

from (

select row_number() over (order by num) as colnumber

from (values (1)) t (num)

group by cube (num, num, num, num, num)

) t5

union all

select '2 to the power of 6', count(*)

from (

select row_number() over (order by num) as colnumber

from (values (1)) t (num)

group by cube (num, num, num, num, num, num)

) t5

and you can also try the inner selects on their own to see the numbers being generated

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 8

Got it. Thanks so much!

Mike Scalise, PMP
https://www.michaelscalise.com