Issue with column parsing

  • Hello,

    I have data from two different sources - Source1 and Source2 - we have a dtsx package where it loads data from source 1 to source 2(temp table) and in the same package the last step is executing stored procedure where it updates the columns.

    UPDATE Emp

    SET Emp.emptyp= temp.studcode

    FROM dbo.Employee Emp

    INNER JOIN student temp

    ON Emp.ID = temp.ID

    Now I have to add a logic in this stored procedure

    after update statement add another set - write a select query select all the data form employee where condiiton ID does not start with ASJ..it returns everything..(basically it will exclude which brings SAP) then I have to apply loop on that curosr with in cursor need to check each row and parse it..i have to get Individual columns and update it...

    10001:4|xxx ( Parsing)

    split it on pipe.get 2 values then split it on colon get emp num and type number - we use these to do joins..

    same columns exists in temp table too.emp num and type number

    xxx- is the description field ..no need it to check this..

  • mcfarlandparkway (12/27/2016)


    Hello,

    I have data from two different sources - Source1 and Source2 - we have a dtsx package where it loads data from source 1 to source 2(temp table) and in the same package the last step is executing stored procedure where it updates the columns.

    UPDATE Emp

    SET Emp.emptyp= temp.studcode

    FROM dbo.Employee Emp

    INNER JOIN student temp

    ON Emp.ID = temp.ID

    Now I have to add a logic in this stored procedure

    after update statement add another set - write a select query select all the data form employee where condiiton ID does not start with ASJ..it returns everything..(basically it will exclude which brings SAP) then I have to apply loop on that curosr with in cursor need to check each row and parse it..i have to get Individual columns and update it...

    10001:4|xxx ( Parsing)

    split it on pipe.get 2 values then split it on colon get emp num and type number - we use these to do joins..

    same columns exists in temp table too.emp num and type number

    xxx- is the description field ..no need it to check this..

    I can't see a question here ...

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • In the stored procedure - I need to implement a loop based logic after update statement ..

    Now I have to add a logic in this stored procedure

    after update statement add another set - write a select query select all the data form employee where condiiton ID does not start with ASJ..it returns everything..(basically it will exclude which brings SAP) then I have to apply loop on that curosr with in cursor need to check each row and parse it..i have to get Individual columns and update it...

    Parsing Logic fields... from a coding perspecitive, I need to look for parsing colon and the pipe

    10001:4|xxx ( Parsing)

    split it on pipe.get 2 values then split it on colon get emp num and type number - we use these to do joins..

    same columns exists in temp table too.emp num and type number

    xxx- is the description field ..no need it to check this..

    How to implement loop based cursor after the update statement?

  • mcfarlandparkway (12/27/2016)


    In the stored procedure - I need to implement a loop based logic after update statement ..

    Now I have to add a logic in this stored procedure

    after update statement add another set - write a select query select all the data form employee where condiiton ID does not start with ASJ..it returns everything..(basically it will exclude which brings SAP) then I have to apply loop on that curosr with in cursor need to check each row and parse it..i have to get Individual columns and update it...

    Parsing Logic fields... from a coding perspecitive, I need to look for parsing colon and the pipe

    10001:4|xxx ( Parsing)

    split it on pipe.get 2 values then split it on colon get emp num and type number - we use these to do joins..

    same columns exists in temp table too.emp num and type number

    xxx- is the description field ..no need it to check this..

    How to implement loop based cursor after the update statement?

    What makes you think you need a cursor? Why not just split the data as a set-based operation?

    Based on the information you've provided, it's impossible to give you a code-based solution. If you want that, please take the time to write out sample DDL, INSERT statements to populate the tables whose DDL you have provided and details of the results which you expect, based on the sample data provided.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Set based approach? How to do this?

    How to parse(split) this 9000430944:7|Information

    -> 9000430944 - emp number

    -- 7 - stud number

    --Information - Description

    How to split on pipe? and How to split on colon?

  • mcfarlandparkway (12/27/2016)


    Set based approach? How to do this?

    How to parse(split) this 9000430944:7|Information

    -> 9000430944 - emp number

    -- 7 - stud number

    --Information - Description

    How to split on pipe? and How to split on colon?

    Totally possible. Here is an example, it relies on you having created DelimitedSplit8k[/url] somewhere in one of your DBs.

    IF OBJECT_ID('tempdb..#tmp', 'U') IS NOT NULL

    DROP TABLE #tmp;

    CREATE TABLE #tmp

    (

    TempId INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED

    , Jumble VARCHAR(50)

    );

    INSERT #tmp

    (

    Jumble

    )

    VALUES

    ('10001:4|xxx')

    , ('10002:5|yyy')

    , ('100933:222|asdml');

    WITH UnJumbled

    AS (SELECT

    t.TempId

    , x.ItemNumber

    , x.Item

    FROM

    #tmp t

    CROSS APPLY dbo.DelimitedSplit8K(REPLACE(t.Jumble, ':', '|'), '|') x

    )

    SELECT

    UnJumbled.TempId

    , EmpNo = MAX(IIF(UnJumbled.ItemNumber = 1, UnJumbled.Item, NULL))

    , StudentNo = MAX(IIF(UnJumbled.ItemNumber = 2, UnJumbled.Item, NULL))

    , Dsc = MAX(IIF(UnJumbled.ItemNumber = 3, UnJumbled.Item, NULL))

    FROM UnJumbled

    GROUP BY UnJumbled.TempId;

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • mcfarlandparkway (12/27/2016)


    Set based approach? How to do this?

    How to parse(split) this 9000430944:7|Information

    -> 9000430944 - emp number

    -- 7 - stud number

    --Information - Description

    How to split on pipe? and How to split on colon?

    Another approach that does not require a splitter:

    WITH pos(TempID, Jumble, d1, d2) AS

    (

    SELECT TempID, Jumble, CHARINDEX(':', Jumble), CHARINDEX('|', Jumble) FROM #tmp

    )

    SELECT

    tempid,

    EmpNo = SUBSTRING(Jumble, 1, d1-1),

    StudentNo = SUBSTRING(jumble, d1+1, d2-d1-1),

    Dsc = SUBSTRING(jumble, d2+1, 50)

    FROM pos;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Nice, Alan. I saw the word 'split' and I was off like one of Pavlov's four-legged friends. Did not even consider alternatives.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin (12/27/2016)


    Nice, Alan. I saw the word 'split' and I was off like one of Pavlov's four-legged friends. Did not even consider alternatives.

    😛

    I did make the assumption that there would always be a colon, then a pipe. If only it were really like that in the real world.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • IIRC, PARSENAME has an element length of NCHAR(128) as a limit on each part it returns and is heavily dependent on the consistent position of items but the following is another way to skin the same cat.

    WITH cteFixDelimiter AS

    (SELECT Jumbled = REPLACE(REPLACE(Jumbled,':','.'),'|','.') FROM #Tmp)

    SELECT EmpNo = PARSENAME(Jumbled,3)

    ,StudentNo = PARSENAME(Jumbled,2)

    ,Dsc = PARSENAME(Jumbled,1)

    FROM cteFixDelimiter

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/28/2016)


    IIRC, PARSENAME has an element length of NCHAR(128) as a limit on each part it returns and is heavily dependent on the consistent position of items but the following is another way to skin the same cat.

    WITH cteFixDelimiter AS

    (SELECT Jumbled = REPLACE(REPLACE(Jumbled,':','.'),'|','.') FROM #Tmp)

    SELECT EmpNo = PARSENAME(Jumbled,3)

    ,StudentNo = PARSENAME(Jumbled,2)

    ,Dsc = PARSENAME(Jumbled,1)

    FROM cteFixDelimiter

    ;

    Just an FYI, but the use of PARSENAME would require that none of the data elements ever contain a period / decimal point. I'd have a preference for maintaining the existing delimiters, and just using the DelimitedSplit8K function to do the splitting. Having to use the REPLACE function is also surprisingly costly in terms of runtime, so if the data quantities are in any way sizable, the split function is probably going to beat the pants off of most other methods. However; chances are; PARSENAME is probably going to be faster than REPLACE.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson (1/2/2017)


    Jeff Moden (12/28/2016)


    IIRC, PARSENAME has an element length of NCHAR(128) as a limit on each part it returns and is heavily dependent on the consistent position of items but the following is another way to skin the same cat.

    WITH cteFixDelimiter AS

    (SELECT Jumbled = REPLACE(REPLACE(Jumbled,':','.'),'|','.') FROM #Tmp)

    SELECT EmpNo = PARSENAME(Jumbled,3)

    ,StudentNo = PARSENAME(Jumbled,2)

    ,Dsc = PARSENAME(Jumbled,1)

    FROM cteFixDelimiter

    ;

    Just an FYI, but the use of PARSENAME would require that none of the data elements ever contain a period / decimal point. I'd have a preference for maintaining the existing delimiters, and just using the DelimitedSplit8K function to do the splitting. Having to use the REPLACE function is also surprisingly costly in terms of runtime, so if the data quantities are in any way sizable, the split function is probably going to beat the pants off of most other methods. However; chances are; PARSENAME is probably going to be faster than REPLACE.

    I have seen the REPLACE performance issue and agree 100%.

    Why any splitter? You can just use CHARINDEX/SUBSTRING (note my earlier example) which will beat the pants off any splitter based solution.

    edit: meant "CHARINDEX" but typed "APPLY"

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (1/2/2017)


    I have seen the REPLACE performance issue and agree 100%.

    Why any splitter? You can just use CHARINDEX/SUBSTRING (note my earlier example) which will beat the pants off any splitter based solution.

    edit: meant "CHARINDEX" but typed "APPLY"

    The last time I saw consistent data in a string that might need splitting, I fell off my dinosaur... Not that it doesn't happen... it just doesn't happen often enough in my world for me to see it...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Alan.B (1/2/2017)


    sgmunson (1/2/2017)


    Jeff Moden (12/28/2016)


    IIRC, PARSENAME has an element length of NCHAR(128) as a limit on each part it returns and is heavily dependent on the consistent position of items but the following is another way to skin the same cat.

    WITH cteFixDelimiter AS

    (SELECT Jumbled = REPLACE(REPLACE(Jumbled,':','.'),'|','.') FROM #Tmp)

    SELECT EmpNo = PARSENAME(Jumbled,3)

    ,StudentNo = PARSENAME(Jumbled,2)

    ,Dsc = PARSENAME(Jumbled,1)

    FROM cteFixDelimiter

    ;

    Just an FYI, but the use of PARSENAME would require that none of the data elements ever contain a period / decimal point. I'd have a preference for maintaining the existing delimiters, and just using the DelimitedSplit8K function to do the splitting. Having to use the REPLACE function is also surprisingly costly in terms of runtime, so if the data quantities are in any way sizable, the split function is probably going to beat the pants off of most other methods. However; chances are; PARSENAME is probably going to be faster than REPLACE.

    I have seen the REPLACE performance issue and agree 100%.

    Why any splitter? You can just use CHARINDEX/SUBSTRING (note my earlier example) which will beat the pants off any splitter based solution.

    edit: meant "CHARINDEX" but typed "APPLY"

    It's kind of a given that your elements shouldn't have periods in them if you use this method. I thought everyone would understand that and that's why I didn't bring it up. It would be kind of like stating that any elements that contain a colon or pipe could be a problem in this case.

    REPLACE isn't actually expensive at all especially for 1:1 replacements... unless you're using a collation where it's slow. You can easily overcome that by using the COLLATE clause with a binary collation. And remember that CHARINDEX, although to a lesser extent, is affected by the same collation problems.

    As for why to use any kind of a splitter at all, consider that's exactly what was done using the two CHARINDEX functions and a calculated substring for each element 😉 and, as I stated in my post, it's just another way to skin the same cat.

    Shifting gears a bit and since no one has yet brought it up... let's talk about the real problems with all of this...

    1. Why is this highly denormalized data being stored in a table to begin with? If it's only stored in a staging table being used to parse such bad data before it gets to a permanent table, that's ok. If it's stored in a permanent table this way, that's a really bad idea and needs to be permanently fixed to avoid such a flagrant violation of what a column is supposed to contain because it will need to be split every time you want to guery an element from it.

    2. Why would anyone use two different delimiters for this when there's only ever going to be a single set of elements in the data?

    3. Why would anyone use a printable character (2 in this case) as a delimiter?

    Using REPLACE or a splitter are minor problems compared to the larger problems posed by this type of data. If it were all done properly to begin with, this thread wouldn't be necessary and no one would have to worry about any of the unnecessary performance problems caused by this kind of data. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/2/2017)


    Alan.B (1/2/2017)


    sgmunson (1/2/2017)


    Jeff Moden (12/28/2016)


    IIRC, PARSENAME has an element length of NCHAR(128) as a limit on each part it returns and is heavily dependent on the consistent position of items but the following is another way to skin the same cat.

    WITH cteFixDelimiter AS

    (SELECT Jumbled = REPLACE(REPLACE(Jumbled,':','.'),'|','.') FROM #Tmp)

    SELECT EmpNo = PARSENAME(Jumbled,3)

    ,StudentNo = PARSENAME(Jumbled,2)

    ,Dsc = PARSENAME(Jumbled,1)

    FROM cteFixDelimiter

    ;

    Just an FYI, but the use of PARSENAME would require that none of the data elements ever contain a period / decimal point. I'd have a preference for maintaining the existing delimiters, and just using the DelimitedSplit8K function to do the splitting. Having to use the REPLACE function is also surprisingly costly in terms of runtime, so if the data quantities are in any way sizable, the split function is probably going to beat the pants off of most other methods. However; chances are; PARSENAME is probably going to be faster than REPLACE.

    I have seen the REPLACE performance issue and agree 100%.

    Why any splitter? You can just use CHARINDEX/SUBSTRING (note my earlier example) which will beat the pants off any splitter based solution.

    edit: meant "CHARINDEX" but typed "APPLY"

    It's kind of a given that your elements shouldn't have periods in them if you use this method. I thought everyone would understand that and that's why I didn't bring it up. It would be kind of like stating that any elements that contain a colon or pipe could be a problem in this case.

    REPLACE isn't actually expensive at all especially for 1:1 replacements... unless you're using a collation where it's slow. You can easily overcome that by using the COLLATE clause with a binary collation. And remember that CHARINDEX, although to a lesser extent, is affected by the same collation problems.

    As for why to use any kind of a splitter at all, consider that's exactly what was done using the two CHARINDEX functions and a calculated substring for each element 😉 and, as I stated in my post, it's just another way to skin the same cat.

    Shifting gears a bit and since no one has yet brought it up... let's talk about the real problems with all of this...

    1. Why is this highly denormalized data being stored in a table to begin with? If it's only stored in a staging table being used to parse such bad data before it gets to a permanent table, that's ok. If it's stored in a permanent table this way, that's a really bad idea and needs to be permanently fixed to avoid such a flagrant violation of what a column is supposed to contain because it will need to be split every time you want to guery an element from it.

    2. Why would anyone use two different delimiters for this when there's only ever going to be a single set of elements in the data?

    3. Why would anyone use a printable character (2 in this case) as a delimiter?

    Using REPLACE or a splitter are minor problems compared to the larger problems posed by this type of data. If it were all done properly to begin with, this thread wouldn't be necessary and no one would have to worry about any of the unnecessary performance problems caused by this kind of data. 😉

    I agree, ... all those things would be wonderful to have be the norm. Unfortunately, folks like you and I have to deal with the "real world", where ideals don't count for much often enough, and bad is normal, and ugly isn't sufficiently uncommon. I sometimes wonder if I would have work if these kinds of "problems" didn't exist.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply