Finding the longest string within a string field

  • Hi,

    This might sounds a bit odd but here goes, thanks for your help in advance. We have some URLs within a bulk block of text some of which are very long. I need to identify rows where such urls exceed say 100 characters in length in amongst other text.

    So the rule would be return a record if within the string there is a string (without spaces) longer than 100 characters.

    Hopefully this is not too confusing and thank you for your help in advance.

    Oliver

  • You may have to use combination of charindex() and len() function to determine this. If you could please provide the sample set then we can give it a go.

  • Hi,

    Please see attached as an example.

    Many Thanks,

    Oliver

    CREATE TABLE [dbo].[A_TABLE](

    [ID] [int] NOT NULL,

    [column1] [nvarchar](max) NULL,

    CONSTRAINT [PK_A_TABLE] PRIMARY KEY CLUSTERED

    (

    [ID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

    GO

    INSERT [dbo].[A_TABLE] ([ID], [column1]) VALUES (1, N'asb unliunliuniluniuniuiluniuniuneiwudbeuwybduweybduweybduywebdukybwedybweuybdweyubedukywbed nnnn unuuuu jjjjj ')

    GO

    INSERT [dbo].[A_TABLE] ([ID], [column1]) VALUES (2, N'eirfreuu erfiuneriufnerufn iurnfliuenfliuwenfiu unliufrenwuifnewuinfliuewnrfiuwruiefniuernfilurneliunwflriunwliufnrwliunfrliuwneflirunwef rufnwerun')

    GO

    INSERT [dbo].[A_TABLE] ([ID], [column1]) VALUES (3, N'ireunfiluwernfiulrenwliufrnweiurnluewniuwnliueniuenfrliunewuifniuewnrieureuinflriunwliuerfn nn urr urrrrrr')

    GO

    INSERT [dbo].[A_TABLE] ([ID], [column1]) VALUES (4, N'frewunfilurewn erifunliuwernfliruwenf')

    GO

    INSERT [dbo].[A_TABLE] ([ID], [column1]) VALUES (5, N' reufniruenfrluienw oeriunfirewifrweionfriweo;ifernfoiwrneoifnweo;inrfioweno;irnfweiornioen;oiewnoifrn;irenniefvjhdbscvuybsd')

    GO

  • Based on that sample data. What should be the output?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Thanks for the reply. The output I would like is to return the ID for the rows which have a continous character string longer than say 60 within the string, so for example it would return rows:

    1

    2

    3

    5

    Row 4 does not have and continous characters strings longer that 60 characters so therefore is excluded.

    Many Thanks,

    Oliver

  • If I understand correctly, you want to check if there's a continuous character string in the column, it doesn't matter the position of it.

    You could do this with a string delimiter and checking each item. The best T-SQL option available is explained in here: http://www.sqlservercentral.com/articles/Tally+Table/72993/

    And I'm using it in the following query.

    DECLARE @Length int = 60

    SELECT DISTINCT ID

    FROM dbo.A_TABLE

    CROSS APPLY dbo.DelimitedSplit8K( column1, ' ')

    WHERE LEN(Item) > @Length

    If you have any questions, feel free to ask.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Many Thanks this works perfectly. I really appreciate the help that splitter is going to be so useful going forward for other tasks. It seems that cross apply seems to be the answer to all my questions at the moment!

    Thanks again

  • Just a quick thought;

    Splitting the string is somewhat like crossing the stream for a sip of water, when all that is needed is to get the position of the spaces/delimiters.

    Here is a suggestion:

    USE tempdb;

    GO

    DECLARE @LENGTH_LIMIT INT = 50;

    DECLARE @DELIM SMALLINT = UNICODE(N' ');

    SELECT

    XX.ID

    FROM

    (

    SELECT

    ROW_NUMBER() OVER

    (

    PARTITION BY X.ID

    ORDER BY (SELECT NULL)

    ) AS PRT_RID

    ,X.ID

    FROM

    (

    SELECT

    PS.ID

    ,LEAD(PS.N,1) OVER

    (

    PARTITION BY PS.ID

    ORDER BY PS.ID, PS.N

    ) - PS.N AS ST_LEN

    FROM

    (

    SELECT TA.ID,0 AS N,@DELIM AS CH_CODE

    FROM dbo.A_TABLE TA

    UNION ALL

    SELECT

    TA.ID

    ,NN.N

    ,UNICODE(SUBSTRING(TA.column1,NN.N,1)) AS CH_CODE

    FROM dbo.A_TABLE TA

    CROSS APPLY

    (

    SELECT

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM

    (SELECT NM.N FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) AS NM(N)) AS N1

    ,(SELECT NM.N FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) AS NM(N)) AS N2

    ,(SELECT NM.N FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) AS NM(N)) AS N3

    ,(SELECT NM.N FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) AS NM(N)) AS N4

    ,(SELECT NM.N FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) AS NM(N)) AS N5

    ) AS NN

    WHERE NN.N <= LEN(TA.column1)

    UNION ALL

    SELECT TA.ID,LEN(TA.column1),@DELIM

    FROM dbo.A_TABLE TA

    ) AS PS

    WHERE PS.CH_CODE = @DELIM

    ) AS X

    WHERE X.ST_LEN > @LENGTH_LIMIT

    ) AS XX

    WHERE XX.PRT_RID = 1;

  • Many Thanks will give this a go.

    Thanks for the help.

  • Eirikur Eiriksson (3/21/2014)


    Just a quick thought;

    Splitting the string is somewhat like crossing the stream for a sip of water, when all that is needed is to get the position of the spaces/delimiters.

    That seems valid, but you're scanning the table trice. Would that perform better than the splitter?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • I can't test this right now because I don't have a 2012 instance available, but it uses Eirikur's idea.

    WITH e1(N) AS(

    SELECT N FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))e(N)

    ),

    e2(N) AS(

    SELECT a.N FROM e1 a, e1 b

    ),

    e4(N) AS(

    SELECT a.N FROM e2 a, e2 b

    ),

    Spaces AS(

    SELECT ID, N

    FROM #A_TABLE a

    CROSS APPLY(

    SELECT TOP(LEN(column1) + 2) ROW_NUMBER() OVER(ORDER BY(SELECT NULL)) - 1 N

    FROM e4

    ) t

    WHERE SUBSTRING( column1 + ' ', N, 1) = ' '

    ),

    Lengths AS(

    SELECT ID, LEAD(N,1) OVER(PARTITION BY ID ORDER BY N) - N AS StringLength

    FROM Spaces

    )

    SELECT ID

    FROM Lengths

    WHERE StringLength > @Length

    GROUP BY ID

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (3/21/2014)


    Eirikur Eiriksson (3/21/2014)


    Just a quick thought;

    Splitting the string is somewhat like crossing the stream for a sip of water, when all that is needed is to get the position of the spaces/delimiters.

    That seems valid, but you're scanning the table trice. Would that perform better than the splitter?

    The reason is that this is the same work as the splitter does, minus the string manipulation. The three scans are a result of the union clauses, the first and the last scans only bring back the identity value.

    I made a mistake in my earlier code as I forgot to limit the number sequence generation to the length of the string. Something that is more than 99 percent of the cost. A correct code is:

    DECLARE @LENGTH_LIMIT INT = 60;

    DECLARE @DELIM SMALLINT = UNICODE(N' ');

    SELECT

    XX.ID

    FROM

    (

    SELECT

    ROW_NUMBER() OVER

    (

    PARTITION BY X.ID

    ORDER BY (SELECT NULL)

    ) AS PRT_RID

    ,X.ID

    FROM

    (

    SELECT

    PS.ID

    ,LEAD(PS.N,1) OVER

    (

    PARTITION BY PS.ID

    ORDER BY PS.ID, PS.N

    ) - PS.N AS ST_LEN

    FROM

    (

    SELECT TA.ID,0 AS N,@DELIM AS CH_CODE

    FROM dbo.A_TABLE TA

    UNION ALL

    SELECT

    TA.ID

    ,NN.N

    ,UNICODE(SUBSTRING(TA.column1,NN.N,1)) AS CH_CODE

    FROM dbo.A_TABLE TA

    CROSS APPLY

    (

    SELECT

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM

    (SELECT NM.N FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) AS NM(N)) AS N1

    ,(SELECT NM.N FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) AS NM(N)) AS N2

    ,(SELECT NM.N FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) AS NM(N)) AS N3

    ,(SELECT NM.N FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) AS NM(N)) AS N4

    ,(SELECT NM.N FROM ( VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1) ) AS NM(N)) AS N5

    ORDER BY (SELECT NULL) OFFSET 0 ROWS FETCH FIRST (LEN(TA.column1)) ROWS ONLY

    ) AS NN

    WHERE NN.N <= LEN(TA.column1)

    UNION ALL

    SELECT TA.ID,LEN(TA.column1),@DELIM

    FROM dbo.A_TABLE TA

    ) AS PS

    WHERE PS.CH_CODE = @DELIM

    ) AS X

    WHERE X.ST_LEN > @LENGTH_LIMIT

    ) AS XX

    WHERE XX.PRT_RID = 1;

  • how about a recursive idea

    WITH cte AS (

    SELECT a.id, a.column1, CAST(0 AS BIGINT) Base, CHARINDEX(' ', a.column1, 0) AS MID

    FROM a_table a

    UNION ALL

    SELECT c.id, c.column1, CHARINDEX(' ', a.column1, c.Mid) AS Base,

    CASE ---- To pick up the last segment

    WHEN CHARINDEX(' ', a.column1, c.Mid+1)>0 THEN CHARINDEX(' ', a.column1, c.Mid+1)

    ELSE LEN(A.column1)

    END AS MID

    FROM a_table a

    INNER JOIN cte c

    ON c.id = a.ID

    WHERE CHARINDEX(' ', a.column1, c.Mid+1)>=0 AND mid<LEN(a.column1) -- Tell the recursive part that it reach the end of the string

    )

    SELECT *

    FROM cte

    WHERE Mid - base >60

    ORDER BY id

    if you want it you can even get the word which length is greater than 60

    Every rule in a world of bits and bytes, can be bend or eventually be broken
    MyBlog About Common dialog control
    A Visualizer for viewing SqlCommand object script [/url]

  • thava (3/24/2014)


    how about a recursive idea

    It's nice, but the DelimitedSplit8K version will work in half the time :-). You might not notice it with a few rows but it becomes clear with larger datasets.

    I still need to test it against the 2012 options. But I don't have that version available.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • And after some testing on SQL Server 2012, the DelimitedSplit8K still is the fastest solution posted in this thread. However, I haven't tested the new version presented by Eirikur in his article: Reaping the benefits of the Window functions in T-SQL[/url]

    I haven't read it in detail but when I do it I might come back 🙂

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply