Issue with column parsing

Question

Issue with column parsing

mcfarlandparkway

SSCertifiable

Points: 7699
More actions
December 27, 2016 at 12:35 pm

#320002

Hello,
I have data from two different sources - Source1 and Source2 - we have a dtsx package where it loads data from source 1 to source 2(temp table) and in the same package the last step is executing stored procedure where it updates the columns.
UPDATE Emp
SET Emp.emptyp= temp.studcode
FROM dbo.Employee Emp
INNER JOIN student temp
ON Emp.ID = temp.ID
Now I have to add a logic in this stored procedure
after update statement add another set - write a select query select all the data form employee where condiiton ID does not start with ASJ..it returns everything..(basically it will exclude which brings SAP) then I have to apply loop on that curosr with in cursor need to check each row and parse it..i have to get Individual columns and update it...
10001:4|xxx ( Parsing)
split it on pipe.get 2 values then split it on colon get emp num and type number - we use these to do joins..
same columns exists in temp table too.emp num and type number
xxx- is the description field ..no need it to check this..

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246985 More actions · Answer 1

mcfarlandparkway (12/27/2016)
Hello,
I have data from two different sources - Source1 and Source2 - we have a dtsx package where it loads data from source 1 to source 2(temp table) and in the same package the last step is executing stored procedure where it updates the columns.
UPDATE Emp
SET Emp.emptyp= temp.studcode
FROM dbo.Employee Emp
INNER JOIN student temp
ON Emp.ID = temp.ID
Now I have to add a logic in this stored procedure
after update statement add another set - write a select query select all the data form employee where condiiton ID does not start with ASJ..it returns everything..(basically it will exclude which brings SAP) then I have to apply loop on that curosr with in cursor need to check each row and parse it..i have to get Individual columns and update it...
10001:4|xxx ( Parsing)
split it on pipe.get 2 values then split it on colon get emp num and type number - we use these to do joins..
same columns exists in temp table too.emp num and type number
xxx- is the description field ..no need it to check this..

I can't see a question here ...

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

mcfarlandparkway SSCertifiable Points: 7699 More actions · Answer 2

In the stored procedure - I need to implement a loop based logic after update statement ..

Now I have to add a logic in this stored procedure

after update statement add another set - write a select query select all the data form employee where condiiton ID does not start with ASJ..it returns everything..(basically it will exclude which brings SAP) then I have to apply loop on that curosr with in cursor need to check each row and parse it..i have to get Individual columns and update it...

Parsing Logic fields... from a coding perspecitive, I need to look for parsing colon and the pipe

10001:4|xxx ( Parsing)

split it on pipe.get 2 values then split it on colon get emp num and type number - we use these to do joins..

same columns exists in temp table too.emp num and type number

xxx- is the description field ..no need it to check this..

How to implement loop based cursor after the update statement?

Phil Parkin SSC Guru Points: 246985 More actions · Answer 3

mcfarlandparkway (12/27/2016)
In the stored procedure - I need to implement a loop based logic after update statement ..
Now I have to add a logic in this stored procedure
after update statement add another set - write a select query select all the data form employee where condiiton ID does not start with ASJ..it returns everything..(basically it will exclude which brings SAP) then I have to apply loop on that curosr with in cursor need to check each row and parse it..i have to get Individual columns and update it...
Parsing Logic fields... from a coding perspecitive, I need to look for parsing colon and the pipe
10001:4|xxx ( Parsing)
split it on pipe.get 2 values then split it on colon get emp num and type number - we use these to do joins..
same columns exists in temp table too.emp num and type number
xxx- is the description field ..no need it to check this..
How to implement loop based cursor after the update statement?

What makes you think you need a cursor? Why not just split the data as a set-based operation?

Based on the information you've provided, it's impossible to give you a code-based solution. If you want that, please take the time to write out sample DDL, INSERT statements to populate the tables whose DDL you have provided and details of the results which you expect, based on the sample data provided.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

mcfarlandparkway SSCertifiable Points: 7699 More actions · Answer 4

Set based approach? How to do this?

How to parse(split) this 9000430944:7|Information

-> 9000430944 - emp number

-- 7 - stud number

--Information - Description

How to split on pipe? and How to split on colon?

Phil Parkin SSC Guru Points: 246985 More actions · Answer 5

mcfarlandparkway (12/27/2016)
Set based approach? How to do this?
How to parse(split) this 9000430944:7|Information
-> 9000430944 - emp number
-- 7 - stud number
--Information - Description
How to split on pipe? and How to split on colon?

Totally possible. Here is an example, it relies on you having created DelimitedSplit8k[/url] somewhere in one of your DBs.

IF OBJECT_ID('tempdb..#tmp', 'U') IS NOT NULL

DROP TABLE #tmp;

CREATE TABLE #tmp

(

TempId INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED

, Jumble VARCHAR(50)

);

INSERT #tmp

(

Jumble

)

VALUES

('10001:4|xxx')

, ('10002:5|yyy')

, ('100933:222|asdml');

WITH UnJumbled

AS (SELECT

t.TempId

, x.ItemNumber

, x.Item

FROM

#tmp t

CROSS APPLY dbo.DelimitedSplit8K(REPLACE(t.Jumble, ':', '|'), '|') x

)

SELECT

UnJumbled.TempId

, EmpNo = MAX(IIF(UnJumbled.ItemNumber = 1, UnJumbled.Item, NULL))

, StudentNo = MAX(IIF(UnJumbled.ItemNumber = 2, UnJumbled.Item, NULL))

, Dsc = MAX(IIF(UnJumbled.ItemNumber = 3, UnJumbled.Item, NULL))

FROM UnJumbled

GROUP BY UnJumbled.TempId;

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

Alan Burstein SSC Guru Points: 61141 More actions · Answer 6

mcfarlandparkway (12/27/2016)
Set based approach? How to do this?
How to parse(split) this 9000430944:7|Information
-> 9000430944 - emp number
-- 7 - stud number
--Information - Description
How to split on pipe? and How to split on colon?

Another approach that does not require a splitter:

WITH pos(TempID, Jumble, d1, d2) AS

(

SELECT TempID, Jumble, CHARINDEX(':', Jumble), CHARINDEX('|', Jumble) FROM #tmp

)

SELECT

tempid,

EmpNo = SUBSTRING(Jumble, 1, d1-1),

StudentNo = SUBSTRING(jumble, d1+1, d2-d1-1),

Dsc = SUBSTRING(jumble, d2+1, 50)

FROM pos;

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Phil Parkin SSC Guru Points: 246985 More actions · Answer 7

Nice, Alan. I saw the word 'split' and I was off like one of Pavlov's four-legged friends. Did not even consider alternatives.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

Alan Burstein SSC Guru Points: 61141 More actions · Answer 8

Phil Parkin (12/27/2016)
Nice, Alan. I saw the word 'split' and I was off like one of Pavlov's four-legged friends. Did not even consider alternatives.

😛

I did make the assumption that there would always be a colon, then a pipe. If only it were really like that in the real world.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 9

IIRC, PARSENAME has an element length of NCHAR(128) as a limit on each part it returns and is heavily dependent on the consistent position of items but the following is another way to skin the same cat.

WITH cteFixDelimiter AS

(SELECT Jumbled = REPLACE(REPLACE(Jumbled,':','.'),'|','.') FROM #Tmp)

SELECT EmpNo = PARSENAME(Jumbled,3)

,StudentNo = PARSENAME(Jumbled,2)

,Dsc = PARSENAME(Jumbled,1)

FROM cteFixDelimiter

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sgmunson SSC Guru Points: 110640 More actions · Answer 10

Jeff Moden (12/28/2016)
IIRC, PARSENAME has an element length of NCHAR(128) as a limit on each part it returns and is heavily dependent on the consistent position of items but the following is another way to skin the same cat.
WITH cteFixDelimiter AS
(SELECT Jumbled = REPLACE(REPLACE(Jumbled,':','.'),'|','.') FROM #Tmp)
SELECT EmpNo = PARSENAME(Jumbled,3)
,StudentNo = PARSENAME(Jumbled,2)
,Dsc = PARSENAME(Jumbled,1)
FROM cteFixDelimiter
;

Just an FYI, but the use of PARSENAME would require that none of the data elements ever contain a period / decimal point. I'd have a preference for maintaining the existing delimiters, and just using the DelimitedSplit8K function to do the splitting. Having to use the REPLACE function is also surprisingly costly in terms of runtime, so if the data quantities are in any way sizable, the split function is probably going to beat the pants off of most other methods. However; chances are; PARSENAME is probably going to be faster than REPLACE.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Alan Burstein SSC Guru Points: 61141 More actions · Answer 11

sgmunson (1/2/2017)
Jeff Moden (12/28/2016)
IIRC, PARSENAME has an element length of NCHAR(128) as a limit on each part it returns and is heavily dependent on the consistent position of items but the following is another way to skin the same cat.
WITH cteFixDelimiter AS
(SELECT Jumbled = REPLACE(REPLACE(Jumbled,':','.'),'|','.') FROM #Tmp)
SELECT EmpNo = PARSENAME(Jumbled,3)
,StudentNo = PARSENAME(Jumbled,2)
,Dsc = PARSENAME(Jumbled,1)
FROM cteFixDelimiter
;
Just an FYI, but the use of PARSENAME would require that none of the data elements ever contain a period / decimal point. I'd have a preference for maintaining the existing delimiters, and just using the DelimitedSplit8K function to do the splitting. Having to use the REPLACE function is also surprisingly costly in terms of runtime, so if the data quantities are in any way sizable, the split function is probably going to beat the pants off of most other methods. However; chances are; PARSENAME is probably going to be faster than REPLACE.

I have seen the REPLACE performance issue and agree 100%.

Why any splitter? You can just use CHARINDEX/SUBSTRING (note my earlier example) which will beat the pants off any splitter based solution.

edit: meant "CHARINDEX" but typed "APPLY"

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

sgmunson SSC Guru Points: 110640 More actions · Answer 12

Alan.B (1/2/2017)
I have seen the REPLACE performance issue and agree 100%.
Why any splitter? You can just use CHARINDEX/SUBSTRING (note my earlier example) which will beat the pants off any splitter based solution.
edit: meant "CHARINDEX" but typed "APPLY"

The last time I saw consistent data in a string that might need splitting, I fell off my dinosaur... Not that it doesn't happen... it just doesn't happen often enough in my world for me to see it...

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 13

Alan.B (1/2/2017)
sgmunson (1/2/2017)
Jeff Moden (12/28/2016)
IIRC, PARSENAME has an element length of NCHAR(128) as a limit on each part it returns and is heavily dependent on the consistent position of items but the following is another way to skin the same cat.
WITH cteFixDelimiter AS
(SELECT Jumbled = REPLACE(REPLACE(Jumbled,':','.'),'|','.') FROM #Tmp)
SELECT EmpNo = PARSENAME(Jumbled,3)
,StudentNo = PARSENAME(Jumbled,2)
,Dsc = PARSENAME(Jumbled,1)
FROM cteFixDelimiter
;
Just an FYI, but the use of PARSENAME would require that none of the data elements ever contain a period / decimal point. I'd have a preference for maintaining the existing delimiters, and just using the DelimitedSplit8K function to do the splitting. Having to use the REPLACE function is also surprisingly costly in terms of runtime, so if the data quantities are in any way sizable, the split function is probably going to beat the pants off of most other methods. However; chances are; PARSENAME is probably going to be faster than REPLACE.
I have seen the REPLACE performance issue and agree 100%.
Why any splitter? You can just use CHARINDEX/SUBSTRING (note my earlier example) which will beat the pants off any splitter based solution.
edit: meant "CHARINDEX" but typed "APPLY"

It's kind of a given that your elements shouldn't have periods in them if you use this method. I thought everyone would understand that and that's why I didn't bring it up. It would be kind of like stating that any elements that contain a colon or pipe could be a problem in this case.

REPLACE isn't actually expensive at all especially for 1:1 replacements... unless you're using a collation where it's slow. You can easily overcome that by using the COLLATE clause with a binary collation. And remember that CHARINDEX, although to a lesser extent, is affected by the same collation problems.

As for why to use any kind of a splitter at all, consider that's exactly what was done using the two CHARINDEX functions and a calculated substring for each element 😉 and, as I stated in my post, it's just another way to skin the same cat.

Shifting gears a bit and since no one has yet brought it up... let's talk about the real problems with all of this...

1. Why is this highly denormalized data being stored in a table to begin with? If it's only stored in a staging table being used to parse such bad data before it gets to a permanent table, that's ok. If it's stored in a permanent table this way, that's a really bad idea and needs to be permanently fixed to avoid such a flagrant violation of what a column is supposed to contain because it will need to be split every time you want to guery an element from it.

2. Why would anyone use two different delimiters for this when there's only ever going to be a single set of elements in the data?

3. Why would anyone use a printable character (2 in this case) as a delimiter?

Using REPLACE or a splitter are minor problems compared to the larger problems posed by this type of data. If it were all done properly to begin with, this thread wouldn't be necessary and no one would have to worry about any of the unnecessary performance problems caused by this kind of data. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sgmunson SSC Guru Points: 110640 More actions · Answer 14

Jeff Moden (1/2/2017)
Alan.B (1/2/2017)
sgmunson (1/2/2017)
Jeff Moden (12/28/2016)
IIRC, PARSENAME has an element length of NCHAR(128) as a limit on each part it returns and is heavily dependent on the consistent position of items but the following is another way to skin the same cat.
WITH cteFixDelimiter AS
(SELECT Jumbled = REPLACE(REPLACE(Jumbled,':','.'),'|','.') FROM #Tmp)
SELECT EmpNo = PARSENAME(Jumbled,3)
,StudentNo = PARSENAME(Jumbled,2)
,Dsc = PARSENAME(Jumbled,1)
FROM cteFixDelimiter
;
Just an FYI, but the use of PARSENAME would require that none of the data elements ever contain a period / decimal point. I'd have a preference for maintaining the existing delimiters, and just using the DelimitedSplit8K function to do the splitting. Having to use the REPLACE function is also surprisingly costly in terms of runtime, so if the data quantities are in any way sizable, the split function is probably going to beat the pants off of most other methods. However; chances are; PARSENAME is probably going to be faster than REPLACE.
I have seen the REPLACE performance issue and agree 100%.
Why any splitter? You can just use CHARINDEX/SUBSTRING (note my earlier example) which will beat the pants off any splitter based solution.
edit: meant "CHARINDEX" but typed "APPLY"
It's kind of a given that your elements shouldn't have periods in them if you use this method. I thought everyone would understand that and that's why I didn't bring it up. It would be kind of like stating that any elements that contain a colon or pipe could be a problem in this case.
REPLACE isn't actually expensive at all especially for 1:1 replacements... unless you're using a collation where it's slow. You can easily overcome that by using the COLLATE clause with a binary collation. And remember that CHARINDEX, although to a lesser extent, is affected by the same collation problems.
As for why to use any kind of a splitter at all, consider that's exactly what was done using the two CHARINDEX functions and a calculated substring for each element 😉 and, as I stated in my post, it's just another way to skin the same cat.
Shifting gears a bit and since no one has yet brought it up... let's talk about the real problems with all of this...
1. Why is this highly denormalized data being stored in a table to begin with? If it's only stored in a staging table being used to parse such bad data before it gets to a permanent table, that's ok. If it's stored in a permanent table this way, that's a really bad idea and needs to be permanently fixed to avoid such a flagrant violation of what a column is supposed to contain because it will need to be split every time you want to guery an element from it.
2. Why would anyone use two different delimiters for this when there's only ever going to be a single set of elements in the data?
3. Why would anyone use a printable character (2 in this case) as a delimiter?
Using REPLACE or a splitter are minor problems compared to the larger problems posed by this type of data. If it were all done properly to begin with, this thread wouldn't be necessary and no one would have to worry about any of the unnecessary performance problems caused by this kind of data. 😉

I agree, ... all those things would be wonderful to have be the norm. Unfortunately, folks like you and I have to deal with the "real world", where ideals don't count for much often enough, and bad is normal, and ugly isn't sufficiently uncommon. I sometimes wonder if I would have work if these kinds of "problems" didn't exist.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)