get a substring value from string

  • Just as a comparison of the methods given I ran the following script for 130k rows:

    -- Finding second last
    DROP TABLE IF EXISTS #Tmp ;
    create table #Tmp (SerialNumber VARCHAR (500));


    INSERT INTO #Tmp
    SELECT top(130000) T.* FROM (VALUES ('1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|2ndLast1|3344562534'),
    ('1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92|93|94|23423432432|2ndLast2|3344562534'),
    ('1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|2432|2ndLast3|3344562534'),
    ('1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|30|2342|3432432|2ndLast4|3344562534'))t(c)
    ,SYS.all_columns A, SYS.all_columns B;

    GO
    set statistics time on
    GO
    -- carsten.saastamoinen method
    DECLARE @X VARCHAR(500);
    WITH Data AS
    (
    SELECT NEWID() AS ID,-- ID for each row
    CONCAT (SerialNumber, '|') AS SerialNumber-- For stop
    FROM #Tmp
    ),
    Split AS
    (
    SELECT ID,
    LEFT (SerialNumber, CHARINDEX ('|', SerialNumber) - 1) AS No,
    1 AS Position,
    RIGHT (SerialNumber, LEN (SerialNumber) - CHARINDEX ('|', SerialNumber)) AS Rest
    FROM Data
    UNION ALL
    SELECT ID,
    LEFT (Rest, CHARINDEX ('|', Rest) - 1) AS No,
    Position + 1,
    RIGHT(Rest, LEN (Rest) - CHARINDEX ('|', Rest))
    FROM Split
    WHERE LEN (Rest) > 1
    ),
    SplitRes AS
    (
    SELECT *,
    ROW_NUMBER() OVER(PARTITION BY ID ORDER BY Position DESC) AS RowNum
    FROM Split
    )
    SELECT No, COUNT(*) Count
    FROM SplitRes
    WHERE RowNum = 2
    GROUP BY No -- 2 could be a parameter
    OPTION(MAXRECURSION 0);
    GO
    -- jonathan roberts method
    DECLARE @X VARCHAR(500)
    SELECT REVERSE(X.value) ValueBeforeLastPipe, COUNT(*) Count
    FROM #TMP t
    CROSS APPLY dbo.STRING_SPLIT(REVERSE(t.SerialNumber),'|') X
    WHERE X.ordinal = 2
    GROUP BY X.value
    GO
    -- ScottPletcher method
    select ValueBeforeLastPipe, COUNT(*) Count
    from #tmp
    cross apply (select charindex('|', reverse(serialnumber)) as last_bar) as ca1
    cross apply (select charindex('|', reverse(left(serialnumber, len(serialnumber)-last_bar))) as next_to_last_bar) as ca2
    cross apply (select case when last_bar = 0 or next_to_last_bar = 0 then ''
    else SUBSTRING(serialnumber, LEN(serialnumber) - last_bar - next_to_last_bar + 2, next_to_last_bar - 1) end as ValueBeforeLastPipe) as ca3
    GROUP BY ValueBeforeLastPipe;
    GO
    set statistics time off

    Time taken for each query on my machine:

    carsten.saastamoinen 176 seconds

    Jonathan Roberts: 8 seconds

    ScottPletcher: 3 seconds

    So Scott's solution is about 60 times faster than carsten.saastamoinen and about 3 times as fast as Jonathan Roberts solution.

  • I will stop here! The focus on which solutions to use is the big difference. My focus will always be a correct and a general solutions. Unless performance is important, because there are a lot of rows and executions happen many times each hour, I will create a solution with CROSS APPLY. But then the overall choice of solution is wrong. The solution should be changed and the use of CROSS APPLY is removed. A table should be created into which the values are inserted when inserting the strings with the many values.

    If there are few rows, the focus should be on maintenance. The examples have shown that incorrect solutions are proposed because focus is performance. But what about using of tempdb. In order for this to be included in the considerations, there must be knowledge of the maximum length of VARCHAR columns in relation to actual content. The calculated need for memory is half of the max length of variable length columns. If the length of the value exceeds this estimated length, tempdb is used. If the length is significantly less, it puts too much strain on the system. This is all about  memory grant. Etc. etc.  etc...

    For me, the following priority is important and my priority - so we will never agree!

    1. The statement returns the correct result with the current data.

    2. The statement gives the correct result tomorrow. Evaluate with data that the definition allows inserted into the tables but does not currently exist.

    3. The statement is maintenance friendly.

    4. The performance of the statement is acceptable.

    Happy Christmas! I will travel to Paris!!!

  • carsten.saastamoinen wrote:

    For me, the following priority is important and my priority - so we will never agree! 1. The statement returns the correct result with the current data. 2. The statement gives the correct result tomorrow. Evaluate with data that the definition allows inserted into the tables but does not currently exist. 3. The statement is maintenance friendly. 4. The performance of the statement is acceptable.

    Happy Christmas! I will travel to Paris!!!

    But Scott's solution is only 6 lines of code, yours is about 25 lines, so Scott's is arguably a lot more maintainable than yours. It also runs 60 times faster and to my knowledge they all return the correct result.

  • And if you want the seven last?

  • carsten.saastamoinen wrote:

    And if you want the seven last?

    And if your server is performing like a dog because you have implemented the most inefficient code you can think of?

  • No, a solution like this is hopefully only on few rows - as mentioned earlier, else change the way to solve the problem. And if my solution runs in 10 seconds - what is the problem? And hopefully only one or few times a day! The CROSS APPLY have a statement for the last, another for the second last, another for the third last and one for the first. Hopefully not any changes in the data!!!! If the data is not selected at insert time, it must be because the other values are used in other solution!

  • carsten.saastamoinen wrote:

    No, a solution like this is hopefully only on few rows - as mentioned earlier, else change the way to solve the problem. And if my solution runs in 10 seconds - what is the problem? And hopefully only one or few times a day! The CROSS APPLY have a statement for the last, another for the second last, another for the third last and one for the first. Hopefully not any changes in the data!!!! If the data is not selected at insert time, it must be because the other values are used in other solution!

    Actually my code was written explicitly to make it handle data changes well while also being efficient.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • carsten.saastamoinen wrote:

    No, a solution like this is hopefully only on few rows - as mentioned earlier, else change the way to solve the problem. And if my solution runs in 10 seconds - what is the problem? And hopefully only one or few times a day! The CROSS APPLY have a statement for the last, another for the second last, another for the third last and one for the first. Hopefully not any changes in the data!!!! If the data is not selected at insert time, it must be because the other values are used in other solution!

    I don't know how you know that this is only for a few rows, will only need to be run a few times a day and might later need to be changed to extract from a different position within the string?

  • Hopefully 10 seconds few times a day is a problem on you server!!!!

  • Else the whole solution is wrong!

  • carsten.saastamoinen wrote:

    Hopefully 10 seconds few times a day is a problem on you server!!!!

    It's not... but the attitude that will cause everything to be written in such a fashion, or worse,  is. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I have seen - and being removed - because I told

    • that DISTINCT do not order data - basic understanding.
    • that it is not better to write COUNT (primary key) or  COUNT (1) instead of COUNT (*). All ends up with the same execution plan.
    • that FOR triggers is not executed before AFTER trigger - it's is just syntax
    • that the right way of counting rows in a table is not to select this information from the system tables
    • that the right position of the Cluster Key is not necessary last
    • that it is not faster to use a filtered index, because it is smaller - yes smaller, but the same information use the same number of pages - filtered or not.
    • that it is not the cardinality that decide the order of the columns in a multicolumn index, but the use and which other index exists.
    •  that LEFT JOIN is not always the right join type
    • that all rules should not be evaluated in a program - the optimizer did not know and are able to use the rule. Data is selected many times for each insert or update. It's again performance that is the reason - but a wrong reason, because data is inserted ones and used many times. And all use the same rules - quality.
    • that floating point data type is not the right data type even it is faster to calculate - and not precisely
    • that it is not necessary to consider the length of a VARCHAR - wrong, important for data quality and performance
    • that it is not better using a value instead of NULL Even version 2022 have new functions where NULL is important for a right solution.
    • that we should not use a function to wash data when inserting, because use of a function do not execute in parallel. One customer, one .. in parallel. Quality before performance or ....? Even that I could use a staging area for washing the data and do it in parallel - one row! Because there is always an alternative for functions - yes but slower and ... All programs is not testing for 2 blanks in the middle, convert an empty string to NULL being sure that there is a value and not just an empty string, ...
    • ...

    Attitude? It is OK for me to be removed from different threads because I not agree - yes, a positive attitude but a bad attitude only having one solution and call others many different things  - it has happened for me many times - because only one solution can be used - never.   And I do not imply that using tally as the only solution is bad, just showing other solutions. All possibilities in the language can be  used and all problems are not solved the same way!!!!! And performance is not always the only goal, but quality is for me always vary important!!!!

  • This was removed by the editor as SPAM

  • Fascinating discussions, but what was "wrong" with my solution? I heard stuff like it is not usable except on SQL 2022, but this IS the SQL 2022 forum, is it not? So why not use the 3rd parameter, and keep the other solutions down in the pre-2022 forums, where hopefully people not using 2022 are looking anyway. I'd think we'd want to learn and use 2022's new features/advantages and share them in this forum.


    Have Fun!
    Ronzo

  • Ronzo wrote:

    Fascinating discussions, but what was "wrong" with my solution? I heard stuff like it is not usable except on SQL 2022, but this IS the SQL 2022 forum, is it not? So why not use the 3rd parameter, and keep the other solutions down in the pre-2022 forums, where hopefully people not using 2022 are looking anyway. I'd think we'd want to learn and use 2022's new features/advantages and share them in this forum.

    I couldn't test it as I don't have SQL 2022 installed. Your solution looks fine though.

Viewing 15 posts - 31 through 45 (of 46 total)

You must be logged in to reply to this topic. Login to reply