May 15, 2014 at 8:17 am
Apologies for the vague topic; I'm not sure how else to say it. I have data coming in from source A. It has to be "cleaned up" then sent off to destination B (and also stored here in an ODS).
There is a specific issue with some of the data that comes in. Users enter their own information on a website over which I have no control. The interface permits them to do the following (apologies, I stink at formatting tables here):
first_name last_name degree1 degree2 degree3 degree4
---------- ---------- ------ ------ ------ -------
John Smith BA MBA
They can enter up to four degrees/certifications or none at all. They are not required by the interface to start with cert1, so you may see someone with two qualifications fill in fields 1 and 3 or 4, or 2 and 3, or 1 and 4.
The data must arrive at the destination looking like this:
first_name last_name degree1 degree2 degree3 degree4
---------- ---------- ------ ------ ------ -------
John Smith BA MBA
So basically all of these values have to be squished and condensed to the left so there are no blank fields between certifications, if that makes sense.
First, I did this (I know, I know...temp tables):
SELECTTRANSACTION_ID,Degree,Value
INTO #tmp_degreeunpivot
FROM(SELECT TRANSACTION_ID,DEGREES1,DEGREES2,DEGREES3,DEGREES4 FROM DBO.mysourcetable
WHERE(DEGREES1 <> '' OR DEGREES2 <> '' OR DEGREES3 <> '' OR DEGREES4 <> '')) AS P
UNPIVOT (Degree FOR Value IN (DEGREES1,DEGREES2,DEGREES3,DEGREES4)) unpvt
ORDER BY TRANSACTION_ID;
Then I deleted the DEGREE1-DEGREE4 values that existed in the source heap for the TRANSACTION_ID values in #tmp_degreeunpivot.
Then I did this:
WITH DEGREE_CTE (TRANSACTION_ID,Degree,DEGREERANK)
AS
(
SELECTTDP.TRANSACTION_ID,
TDP.Degree,
'DEGREES' + CAST(TDP.RNK AS char(1)) AS DEGREERANK
FROM(SELECTTRANSACTION_ID,
DEGREE,
RANK() OVER (PARTITION BY TRANSACTION_ID ORDER BY VALUE ASC) AS RNK
FROM#tmp_degreeunpivot) AS TDP
)
SELECTTRANSACTION_ID,
DEGREES1,
DEGREES2,
DEGREES3,
DEGREES4
INTO#tmp_degreepivot
FROM(SELECT TRANSACTION_ID,Degree,DEGREERANK FROM DEGREE_CTE) AS up
PIVOT (MAX(DEGREE) for DEGREERANK in (DEGREES1,DEGREES2,DEGREES3,DEGREES4)) AS PVT;
Then I updated the source with the values in the correct columns:
UPDATEME
SETME.DEGREES1 = COALESCE(PVT.DEGREES1,''),
ME.DEGREES2 = COALESCE(PVT.DEGREES2,''),
ME.DEGREES3 = COALESCE(PVT.DEGREES3,''),
ME.DEGREES4 = COALESCE(PVT.DEGREES4,'')
FROM#tmp_degreepivot AS PVT
INNER JOINdbo.mysourcetable AS ME
ONPVT.TRANSACTION_ID = ME.TRANSACTION_ID;
Does it work? Yes. Does it perform well? Reasonably, and the files we'll be working with never have more than 100 records per day in them, so if it takes an extra few seconds to run it's not a big deal.
Am I happy with it? Yes and no. I'm glad that I solved the problem, but I keep wondering if there was a better way to do it. A more elegant way.
May 15, 2014 at 10:27 am
I wish you could post table schema and sample data.
SET NOCOUNT ON;
USE tempdb;
GO
WITH C1 AS (
SELECT
T.first_name,
T.last_name,
R.degree,
R.pos,
ROW_NUMBER()OVER(PARTITION BY T.first_name, T.last_name ORDER BY R.pos) AS rn
FROM
(
VALUES
('John', 'Smith', '', 'BA', '', 'MBA'),
('Some', 'Name', 'BA', 'MBA', '', 'PhD')
) AS T(first_name, last_name, degree1, degree2, degree3, degree4)
CROSS APPLY
(
VALUES
(degree1, 1),
(degree2, 2),
(degree3, 3),
(degree4, 4)
) AS R(degree, pos)
WHERE
degree > ''
)
SELECT
first_name,
last_name,
MAX(CASE WHEN rn = 1 THEN degree ELSE '' END) AS degree1,
MAX(CASE WHEN rn = 2 THEN degree ELSE '' END) AS degree2,
MAX(CASE WHEN rn = 3 THEN degree ELSE '' END) AS degree3,
MAX(CASE WHEN rn = 4 THEN degree ELSE '' END) AS degree4
FROM
C1
GROUP BY
first_name,
last_name;
GO
I am using VALUES as the row constructor to simulate your table, but the idea here is to use APPLY operator to unpivot the data, exclude the rows where there is no degree, enumerate the rest and do pivot again.
May 15, 2014 at 10:46 am
hunchback (5/15/2014)
I wish you could post table schema and sample data.
SET NOCOUNT ON;
USE tempdb;
GO
WITH C1 AS (
SELECT
T.first_name,
T.last_name,
R.degree,
R.pos,
ROW_NUMBER()OVER(PARTITION BY T.first_name, T.last_name ORDER BY R.pos) AS rn
FROM
(
VALUES
('John', 'Smith', '', 'BA', '', 'MBA'),
('Some', 'Name', 'BA', 'MBA', '', 'PhD')
) AS T(first_name, last_name, degree1, degree2, degree3, degree4)
CROSS APPLY
(
VALUES
(degree1, 1),
(degree2, 2),
(degree3, 3),
(degree4, 4)
) AS R(degree, pos)
WHERE
degree > ''
)
SELECT
first_name,
last_name,
MAX(CASE WHEN rn = 1 THEN degree ELSE '' END) AS degree1,
MAX(CASE WHEN rn = 2 THEN degree ELSE '' END) AS degree2,
MAX(CASE WHEN rn = 3 THEN degree ELSE '' END) AS degree3,
MAX(CASE WHEN rn = 4 THEN degree ELSE '' END) AS degree4
FROM
C1
GROUP BY
first_name,
last_name;
GO
I am using VALUES as the row constructor to simulate your table, but the idea here is to use APPLY operator to unpivot the data, exclude the rows where there is no degree, enumerate the rest and do pivot again.
Awesome, this works! 🙂 Thank you!
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply