Splitting a crazy column into two new columns

  • rkelly58 wrote:

    thanks Des,

    asked third party aboyut the indexed views, they said absolutely not, will void contract, lol

    8<

    >8

    Wow! that takes their their shonky datatyping and design to a whole new level! I do feel for you, I have to look after two DBs in the estate which do not allow data modification in the DB (they have put too many data integrity checks in the application, not the DB mainly) but even in those, we can add custom objects to make reporting etc easier. They annoy me enough.

    I cannot begin to understand the frustration you must be feeling when trying to work around the awfulness of their system when you get responses from them like that!

    "Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it. When we enquire into any subject, the first thing we have to do is to know what books have treated of it. This leads us to look at catalogues, and at the backs of books in libraries."
    — Samuel Johnson
    I wonder, would the great Samuel Johnson have replaced that with "GIYF" now?

  • Hi Des,

    When running the SQL against the live data source, just to test it, i'm getting this error:

    Msg 241, Level 16, State 1, Line 1

    Conversion failed when converting date and/or time from character string.

    I stripped back the code and it doesn't even like the WITH ctebooking AS ( line

    OriginalColumn in the source is varchar(4000) same as the test table i used....

    Any ideas?

    • This reply was modified 4 years, 1 month ago by  rkelly58.
  • That speaks to a record whose data does not match the pattern of data that was used for creating the solution at hand.

    You will need to find the record(s) that are causing a problem, and either change the data to match what the code can handle, or extend the code to handle the new data pattern;

  • I strongly agree with your decision to clean the data before it goes to the database. However, we never designed SQL to be a string handling language. If you can clean it up before, using a language with features that support much better string editing, and then inserted into the database then life will be much better. Trust me; I have been there and done that. In my case, my attempt in SQL to scrub data was just a little wrong and was put in schema. We found out that it was wrong, the hard way.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • Just wanted to reiterate my initial question to the OP, is the format fixed?

    😎

    The reason for asking is that if the format is fixed, one can easily half the work required.

  • Even if the format is NOT fixed, you can use the Tally Table-based string splitter function from this very site (authored by Jeff Moden in an Article with a title beginning with "Tally Oh".  That string splitter function will return a row for each "token" within the string, based on a single-character delimiter.   Using the space character as the delimiter, tokens 2, 3, and 5 are of interest - ItemNumber is the output column that has the token number.   Then you can use TRY_CONVERT function and get a NULL value when the character string representing a date and time doesn't convert.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson wrote:

    Even if the format is NOT fixed, you can use the Tally Table-based string splitter function from this very site (authored by Jeff Moden in an Article with a title beginning with "Tally Oh".  That string splitter function will return a row for each "token" within the string, based on a single-character delimiter.   Using the space character as the delimiter, tokens 2, 3, and 5 are of interest - ItemNumber is the output column that has the token number.   Then you can use TRY_CONVERT function and get a NULL value when the character string representing a date and time doesn't convert.

    If the format is fixed, the effort is much less than using a conventional string splitting method, think of 3 operations vs. number of characters three times!

    😎

    Done some work on this in the past 😉

  • Eirikur Eiriksson wrote:

    sgmunson wrote:

    Even if the format is NOT fixed, you can use the Tally Table-based string splitter function from this very site (authored by Jeff Moden in an Article with a title beginning with "Tally Oh".  That string splitter function will return a row for each "token" within the string, based on a single-character delimiter.   Using the space character as the delimiter, tokens 2, 3, and 5 are of interest - ItemNumber is the output column that has the token number.   Then you can use TRY_CONVERT function and get a NULL value when the character string representing a date and time doesn't convert.

    If the format is fixed, the effort is much less than using a conventional string splitting method, think of 3 operations vs. number of characters three times!

    😎

    Done some work on this in the past 😉

    I was thinking along the lines of an APPLY with a CROSS TAB query that uses the splitting function and only selects ItemNumber IN (2,3,5), and the CROSS TAB methodology groups it back to a single row, so you still only process that string once.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • rkelly58 wrote:

    Hi Des,

    When running the SQL against the live data source, just to test it, i'm getting this error:

    Msg 241, Level 16, State 1, Line 1

    Conversion failed when converting date and/or time from character string.

    I stripped back the code and it doesn't even like the WITH ctebooking AS ( line

    OriginalColumn in the source is varchar(4000) same as the test table i used....

    Any ideas?

    In your query take the part which you use to get BookingDate:

    BookingDate AS CONVERT(datetime, SUBSTRING(...))

    and put it into a WHERE clause:

    SELECT /*no conversion, just the string*/ SUBSTRING(...
    FROM ....
    WHERE ISDATE(SUBSTRING(...)) = 0

    _____________
    Code for TallyGenerator

  • Hello everyone,

    Thanks for your replies, again I really appreciate your input.

    Des, you were spot on, when I looked at the Live Data for that field, i was horrified to find that the field is not fixed at all....

    Here are the varying entries (while i wipe a tear)

    • Booked on: 15/08/2020 12:28 by: X1234567
    • Booked & paid on 08/01/2020 14:39:10 by: SELF SERVICE
    • Booked & paid on 09/11/2019 00:45:47 by: SELF SERVICE Attended on 09/12/2020 00:45:47 (SELF SERVICE)
    • Booked on: 11/01/2018 11:52 by: X1234567 Amended on: 19/01/2018 12:29 by: Y7654321 - Note: <-- THIS IS A DIFFERENT ID
    • Booked on: 14/02/2018 12:57 by: X1234567 Amended on: 14/02/2018 17:17 by: X1234567 Paid on: 17 Jun 2018 17:00 by: Y7654321 (again a different ID)
    • Booked on: 02/01/2019 12:17 by: X1234567 Booked on: 05/01/2019 12:06 by: Y1234567 Booked on: 05/01/2019 12:07 by: Y1234567 Booked on: 05/01/2019 12:08 by: Y1234567

    Basically, it's a complete mess. I've confronted the company and told it is what it is..... charming eh? I'll carry on fighting this in the background.

    In answer to some of the questions:

    a) No, it's obviously not a fixed length

    b) Sometimes the user isn't even shown, it just has "self service"

    Can the script be altered so it takes the First Booked on date and the UserID?

    Where there isn't a  Booked on By UserID, it puts in Self Service as the User?

    Is this even possible?

    Thanks again

     

  • Your downfall here is going to be trying to do it all at once.

    You cannot touch the vendor DB, but there is nothing stopping you from creating your own staging DB, and doing all the processing that you need there, then copying the cleaned data to the final destination.

  • Based on the latest sample data that you provided, this code works.

    However, I fully expect that you will get other non-compliant records.

    You can keep tweaking the code to the point where it is a complete mess, or you can do the right thing and set up a staging DB where you do your manipulations.

    DECLARE @TestTable table (
    OriginalColumn varchar(200)
    );

    INSERT INTO @TestTable ( OriginalColumn )
    VALUES ('Booked on: 15/08/2020 12:28 by: X1234567')
    , ('Booked & paid on 08/01/2020 14:39:10 by: SELF SERVICE')
    , ('Booked & paid on 09/11/2019 00:45:47 by: SELF SERVICE Attended on 09/12/2020 00:45:47 (SELF SERVICE)')
    , ('Booked on: 11/01/2018 11:52 by: X1234567 Amended on: 19/01/2018 12:29 by: Y7654321 - Note: <-- THIS IS A DIFFERENT ID')
    , ('Booked on: 14/02/2018 12:57 by: X1234567 Amended on: 14/02/2018 17:17 by: X1234567 Paid on: 17 Jun 2018 17:00 by: Y7654321 (again a different ID)')
    , ('Booked on: 02/01/2019 12:17 by: X1234567 Booked on: 05/01/2019 12:06 by: Y1234567 Booked on: 05/01/2019 12:07 by: Y1234567 Booked on: 05/01/2019 12:08 by: Y1234567');


    WITH cteMassagedData AS (
    SELECT OriginalColumn
    , CleanedForFirstValues = REPLACE(REPLACE(REPLACE(REPLACE(OriginalColumn, ' & paid', ''), ' on: ', ' on '), ' by: ', ' by '), ' ', ' ')
    FROM @TestTable
    WHERE REPLACE(REPLACE(REPLACE(REPLACE(OriginalColumn, ' & paid', ''), ' on: ', ' on '), ' by: ', ' by '), ' ', ' ') LIKE 'Booked on [0-3][0-9]/[0-1][0-9]/[1-2][0-9][0-9][0-9]%'
    )
    , cteFirstUser AS (
    SELECT OriginalColumn
    , CleanedForFirstValues
    , FirstUserStart = PATINDEX('% by %', CleanedForFirstValues)
    FROM cteMassagedData
    )
    SELECT OriginalColumn
    , BookingDate = TRY_CONVERT(datetime, SUBSTRING(CleanedForFirstValues, 11, FirstUserStart-11), 103)
    , ByWho = LTRIM(RTRIM(SUBSTRING(CleanedForFirstValues, FirstUserStart+4, ISNULL(SecondPos.Pos, LEN(CleanedForFirstValues)) )))
    FROM cteFirstUser
    OUTER APPLY (SELECT MIN(x.Pos)
    FROM (VALUES ( PATINDEX('% Booked %' , LTRIM(RTRIM(SUBSTRING(CleanedForFirstValues, FirstUserStart+4, LEN(CleanedForFirstValues))))) )
    , ( PATINDEX('% Amended %' , LTRIM(RTRIM(SUBSTRING(CleanedForFirstValues, FirstUserStart+4, LEN(CleanedForFirstValues))))) )
    , ( PATINDEX('% Attended %', LTRIM(RTRIM(SUBSTRING(CleanedForFirstValues, FirstUserStart+4, LEN(CleanedForFirstValues))))) )
    ) AS x(Pos)
    WHERE x.Pos > 0
    ) AS SecondPos(Pos);

  • Can you check if the event records in your string are separated not just by space but some special character, like CHAR(13)?

    you may use output to text in SSMS, or convert the string to varbinary(1000)

    _____________
    Code for TallyGenerator

  • DesNorton wrote:

    Your downfall here is going to be trying to do it all at once.

    You cannot touch the vendor DB, but there is nothing stopping you from creating your own staging DB, and doing all the processing that you need there, then copying the cleaned data to the final destination.

    +1 for this. I often stage stuff coming in for this reason. I can run multiple cleaning steps on the data.

  • Des,

    I ran the script, still getting some added stuff appearing in the ByWho column:

    SELF SERVICE Attended on 15/02/2018 09:43:48 (SELF SERVICE)

    X1234567 Paid on 20 Feb 2018 12:04 by Y1234567

    X1234567 Amended on 14/02/2018 17:17 by X1234567 Paid on 17 Jun 2018 17:00 by Y1234567

     

    Can we get it so, it just pulls the left most UserID only and not the extra text afterwards

    The same for SELF Service, can we just get it to say SELF SERVICE and ignore the text of the characters after?

     

    I want to bring back the last 4 hours worth of data, so tried this at the end:

    SELECT *

    FROM cteFirstUser

    WHERE BookingDate >= DATEADD(hh, -4, GETDATE())

    But it says invalid object name 'cteFirstUser'

    Many thanks

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply