February 10, 2021 at 5:27 pm
Hi all,
I could use some assistance with this one as I can't get anything working correctly for it at the moment.
I'm having to move some HTML formatted data between two applications and they have a few differing quirks regarding how they handle HTML formatting.
The specific quirk that is causing me issues is how the applications handle the bold html tags <b> and </b> differently.
The source application allows a line break (<br/>) between <b> and </b> tags and will treat everything until to closing </b> tag as being bold
The new application requires that the bold tag is closed off before the line break and opened again afterwards
So the existing data might look like this:
<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>
Where the new application will require it to look like this to display correctly:
<b>This text is bold</b> This text is not <b> This text is bold</b><br/><b>as is this text</b>
I've attempted to solve this by finding out the position of the first opening <b> tag, then getting the position of the next closing </b> tag and using these variables to run a replace on the text contained within them:
SET @HTMLText = REPLACE(@HTMLText, SUBSTRING(@HTMLText, @Start, @Chars), REPLACE(SUBSTRING(@HTMLText, @Start, @Chars), '<br/>', '</b><br/><b>'))
I then update the start and end variables to find the next opening <b> tag and loop through the rest of the string, which can be a fairly large chunk of text. However, what I've come up with isn't working properly.
Has anyone got an idea for a good solution to this one?
Thanks for taking the time to read this.
Paul
February 10, 2021 at 6:24 pm
Could do this with recursion, from right to left, but for now I'll just do a simple cursor loop. Hopefully that will perform well enough for what you need. Naturally you'll need to additional data samples to fully test out its logic.
IF OBJECT_ID('tempdb.dbo.#data') IS NOT NULL
DROP TABLE #data;
CREATE TABLE #data (
id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
string nvarchar(max) NULL
);
INSERT INTO #data VALUES
('<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>')
DECLARE @byte_b_start int
DECLARE @byte_b_end int
DECLARE @byte_br_start int
DECLARE @id int
DECLARE @string nvarchar(max)
DECLARE @string_was_changed bit
DECLARE @stuff_length int
DECLARE cursor_strings CURSOR LOCAL FAST_FORWARD FOR
SELECT id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
SELECT 'before', id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
OPEN cursor_strings
WHILE 1 = 1
BEGIN
FETCH NEXT FROM cursor_strings INTO @id, @string
IF @@FETCH_STATUS <> 0
IF @@FETCH_STATUS = -2
CONTINUE
ELSE
BREAK;
SET @byte_b_start = 1
SET @string_was_changed = 0
WHILE 2 = 2
BEGIN
--SELECT 'before', @string
SET @byte_b_start = CHARINDEX('<b>', @string, @byte_b_start)
IF @byte_b_start = 0
BREAK;
SET @byte_b_end = CHARINDEX('</b>', @string, @byte_b_start + 4)
SET @byte_br_start = CHARINDEX('<br', @string, @byte_b_start + 4)
IF @byte_br_start = 0
BREAK;
IF @byte_b_start > 0 AND @byte_br_start < @byte_b_end
BEGIN
IF SUBSTRING(@string, @byte_br_start + 3, 1) = '/'
SET @stuff_length = 5
ELSE
SET @stuff_length = 4
SET @string = STUFF(@string, @byte_br_start, @stuff_length,
'</b>' + SUBSTRING(@string, @byte_br_start, @stuff_length) + '<b>')
SET @string_was_changed = 1
SET @byte_b_start = @byte_b_end + 12
END /*IF*/
ELSE
BEGIN
SET @byte_b_start = @byte_b_end + 3
END /*ELSE*/
--SELECT 'after_', @string, @byte_b_start, @byte_b_end, @byte_br_start
END /*WHILE 2*/
IF @string_was_changed > 0
UPDATE #data
SET string = @string
WHERE id = @id
END /*WHILE 1*/
DEALLOCATE cursor_strings
SELECT 'after_', id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
February 10, 2021 at 6:43 pm
Hi Scott,
Thanks for the quick reply - I'm just finishing for the day and I'll pick this up in the morning, at first glance it looks promising.
Paul
February 10, 2021 at 7:27 pm
Could do this with recursion, from right to left, but for now I'll just do a simple cursor loop. Hopefully that will perform well enough for what you need. Naturally you'll need to additional data samples to fully test out its logic.
IF OBJECT_ID('tempdb.dbo.#data') IS NOT NULL
DROP TABLE #data;
CREATE TABLE #data (
id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
string nvarchar(max) NULL
);
INSERT INTO #data VALUES
('<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>')
DECLARE @byte_b_start int
DECLARE @byte_b_end int
DECLARE @byte_br_start int
DECLARE @id int
DECLARE @string nvarchar(max)
DECLARE @string_was_changed bit
DECLARE @stuff_length int
DECLARE cursor_strings CURSOR LOCAL FAST_FORWARD FOR
SELECT id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
SELECT 'before', id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
OPEN cursor_strings
WHILE 1 = 1
BEGIN
FETCH NEXT FROM cursor_strings INTO @id, @string
IF @@FETCH_STATUS <> 0
IF @@FETCH_STATUS = -2
CONTINUE
ELSE
BREAK;
SET @byte_b_start = 1
SET @string_was_changed = 0
WHILE 2 = 2
BEGIN
--SELECT 'before', @string
SET @byte_b_start = CHARINDEX('<b>', @string, @byte_b_start)
IF @byte_b_start = 0
BREAK;
SET @byte_b_end = CHARINDEX('</b>', @string, @byte_b_start + 4)
SET @byte_br_start = CHARINDEX('<br', @string, @byte_b_start + 4)
IF @byte_br_start = 0
BREAK;
IF @byte_b_start > 0 AND @byte_br_start < @byte_b_end
BEGIN
IF SUBSTRING(@string, @byte_br_start + 3, 1) = '/'
SET @stuff_length = 5
ELSE
SET @stuff_length = 4
SET @string = STUFF(@string, @byte_br_start, @stuff_length,
'</b>' + SUBSTRING(@string, @byte_br_start, @stuff_length) + '<b>')
SET @string_was_changed = 1
SET @byte_b_start = @byte_b_end + 12
END /*IF*/
ELSE
BEGIN
SET @byte_b_start = @byte_b_end + 3
END /*ELSE*/
--SELECT 'after_', @string, @byte_b_start, @byte_b_end, @byte_br_start
END /*WHILE 2*/
IF @string_was_changed > 0
UPDATE #data
SET string = @string
WHERE id = @id
END /*WHILE 1*/
DEALLOCATE cursor_strings
SELECT 'after_', id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
Hi Scott,
I couldn't resist trying it - The only thing it doesn't look to cater for is if there are more than a single <br/> tag in-between a single set of <b> </b> tags
So for row 2:
INSERT INTO #data VALUES
('<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>'),
('<b>This text is bold<br/>this text is also bold as is this text<br/>and this text is still bold as well</b>This text is not bold.<b>This text is bold<br/> as is this text</b>')
you end up with the following:
<b>This text is bold</b><br/><b> this text is also bold as is this text<br/>and this text is still bold as well</b>This text is not bold.<b>This text is bold</b><br/><b>as is this text</b>
Obviously you can just give the cursor another pass over the data, but I'm sure there is a more elegant solution.
I'm really signing off now.
Thanks again
Paul
February 10, 2021 at 8:24 pm
I think this is right. Naturally you'll need to confirm for yourself, as I don't have any additional test data.
IF OBJECT_ID('tempdb.dbo.#data') IS NOT NULL
DROP TABLE #data;
CREATE TABLE #data (
id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
string nvarchar(max) NULL
);
INSERT INTO #data VALUES
('<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>'),
('<b>This text is bold<br/>this text is also bold as is this text<br/>and this text is still bold as well</b>This text is not bold.<b>This text is bold<br/> as is this text</b>')
DECLARE @byte_b_start int
DECLARE @byte_b_end int
DECLARE @byte_b_offset int
DECLARE @byte_br_start int
DECLARE @id int
DECLARE @string nvarchar(max)
DECLARE @string_was_changed bit
DECLARE @stuff_length int
DECLARE cursor_strings CURSOR LOCAL FAST_FORWARD FOR
SELECT id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
SELECT 'before', id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
OPEN cursor_strings
WHILE 1 = 1
BEGIN
FETCH NEXT FROM cursor_strings INTO @id, @string
IF @@FETCH_STATUS <> 0
IF @@FETCH_STATUS = -2
CONTINUE
ELSE
BREAK;
SET @byte_b_start = 1
SET @byte_b_offset = 4
SET @string_was_changed = 0
WHILE 2 = 2
BEGIN
--SELECT 'before', @string
SET @byte_b_start = CHARINDEX('<b>', @string, @byte_b_start)
IF @byte_b_start = 0
BREAK;
SET @byte_b_end = CHARINDEX('</b>', @string, @byte_b_start + @byte_b_offset)
WHILE 3 = 3
BEGIN
SET @byte_br_start = CHARINDEX('<br', @string, @byte_b_start + @byte_b_offset)
IF @byte_br_start = 0
BREAK;
IF @byte_br_start < @byte_b_end
BEGIN
IF SUBSTRING(@string, @byte_br_start + 3, 1) = '/'
SET @stuff_length = 5
ELSE
SET @stuff_length = 4
SET @string = STUFF(@string, @byte_br_start, @stuff_length,
'</b>' + SUBSTRING(@string, @byte_br_start, @stuff_length) + '<b>')
SET @byte_b_offset = @byte_b_offset + 7
SET @byte_b_start = @byte_br_start + 12
SET @string_was_changed = 1
END /*IF*/
ELSE
BREAK;
END /*WHILE 3*/
SET @byte_b_start = @byte_b_end + @byte_b_offset
--SELECT 'after_', @string, @byte_b_start, @byte_b_end, @byte_br_start
END /*WHILE 2*/
IF @string_was_changed > 0
UPDATE #data
SET string = @string
WHERE id = @id
END /*WHILE 1*/
DEALLOCATE cursor_strings
SELECT 'after_', id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
February 11, 2021 at 1:00 pm
I think this is right. Naturally you'll need to confirm for yourself, as I don't have any additional test data.
IF OBJECT_ID('tempdb.dbo.#data') IS NOT NULL
DROP TABLE #data;
CREATE TABLE #data (
id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
string nvarchar(max) NULL
);
INSERT INTO #data VALUES
('<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>'),
('<b>This text is bold<br/>this text is also bold as is this text<br/>and this text is still bold as well</b>This text is not bold.<b>This text is bold<br/> as is this text</b>')
DECLARE @byte_b_start int
DECLARE @byte_b_end int
DECLARE @byte_b_offset int
DECLARE @byte_br_start int
DECLARE @id int
DECLARE @string nvarchar(max)
DECLARE @string_was_changed bit
DECLARE @stuff_length int
DECLARE cursor_strings CURSOR LOCAL FAST_FORWARD FOR
SELECT id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
SELECT 'before', id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
OPEN cursor_strings
WHILE 1 = 1
BEGIN
FETCH NEXT FROM cursor_strings INTO @id, @string
IF @@FETCH_STATUS <> 0
IF @@FETCH_STATUS = -2
CONTINUE
ELSE
BREAK;
SET @byte_b_start = 1
SET @byte_b_offset = 4
SET @string_was_changed = 0
WHILE 2 = 2
BEGIN
--SELECT 'before', @string
SET @byte_b_start = CHARINDEX('<b>', @string, @byte_b_start)
IF @byte_b_start = 0
BREAK;
SET @byte_b_end = CHARINDEX('</b>', @string, @byte_b_start + @byte_b_offset)
WHILE 3 = 3
BEGIN
SET @byte_br_start = CHARINDEX('<br', @string, @byte_b_start + @byte_b_offset)
IF @byte_br_start = 0
BREAK;
IF @byte_br_start < @byte_b_end
BEGIN
IF SUBSTRING(@string, @byte_br_start + 3, 1) = '/'
SET @stuff_length = 5
ELSE
SET @stuff_length = 4
SET @string = STUFF(@string, @byte_br_start, @stuff_length,
'</b>' + SUBSTRING(@string, @byte_br_start, @stuff_length) + '<b>')
SET @byte_b_offset = @byte_b_offset + 7
SET @byte_b_start = @byte_br_start + 12
SET @string_was_changed = 1
END /*IF*/
ELSE
BREAK;
END /*WHILE 3*/
SET @byte_b_start = @byte_b_end + @byte_b_offset
--SELECT 'after_', @string, @byte_b_start, @byte_b_end, @byte_br_start
END /*WHILE 2*/
IF @string_was_changed > 0
UPDATE #data
SET string = @string
WHERE id = @id
END /*WHILE 1*/
DEALLOCATE cursor_strings
SELECT 'after_', id, string
FROM #data
WHERE string LIKE '%<b>%<br%</b%'
Hi Scott,
The updated cursor does work as expected with the very limited test data I gave you, however when running it against some real data it ends up falling into an endless loop on certain records (the original version does not suffer from this).
I've tracked down one of the offending records and anonymised the data, so the following will never return a result from the second version of the cursor.
IF OBJECT_ID('tempdb.dbo.#data') IS NOT NULL
DROP TABLE #data;
CREATE TABLE #data (
id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
string nvarchar(max) NULL
);
INSERT INTO #data VALUES
('<b>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/> <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text''s<br/> <br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>--------------------------------------------------------------<br/>Test text Test text <br/> <br/>Test text Test text - Test text Test text </b>- Test text Test text <br/><br/><b>Test text Test text </b>Test text Test text Test text Test text Test text Test text Test text Test text .<br/><br/><b>Test text Test text </b> Test text Test text <br/><br/><b>Test text Test text </b>. <br/><br/><b>Test text Test text </b><br/><br/><b>Test text Test text </b>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/><b>Test text Test text </b>) - Test text Test text <br/><br/><b>10/01/14 Test text Test text </b> - Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>18/12Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text <br/> <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/> <br/>Test text Test text <br/><br/>12/02Test text Test text <br/>Test text Test text <br/>Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>16/03/2016 Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/> <br/><br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/> <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/> <br/><br/> <br/><b><u>Test text Test text </u></b><br/>Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text <br/><br/>')
I've not worked out what part is causing the issue yet...
Thanks
Paul
February 11, 2021 at 4:06 pm
...hope you can work it out.
You really need to provide truly representative sample data if you want a full solution.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
February 11, 2021 at 10:33 pm
I did this exercise a few years ago. There were many, many combinations that caused problems, plus we had cases where the html markups were quoted in the text and had to be excluded. It did seem easy up front, but applying solutions to real data just kept throwing issues.
Good luck!
February 12, 2021 at 11:12 am
...hope you can work it out.
You really need to provide truly representative sample data if you want a full solution.
With the sample data thing, with this one it's incredibly difficult to do - We are talking about nearly half a million user input rows, with an average of 25,500 characters each and there is no consistent rhyme or reason for how they have decided to apply formatting.
Thanks again for your help, it's put me on the path to getting this last issue sorted.
Paul
February 12, 2021 at 11:13 am
I did this exercise a few years ago. There were many, many combinations that caused problems, plus we had cases where the html markups were quoted in the text and had to be excluded. It did seem easy up front, but applying solutions to real data just kept throwing issues.
Good luck!
I feel your pain 🙂
It's been an interesting one for sure but I'm just about there now I think...
February 12, 2021 at 4:35 pm
ScottPletcher wrote:...hope you can work it out.
You really need to provide truly representative sample data if you want a full solution.
With the sample data thing, with this one it's incredibly difficult to do - We are talking about nearly half a million user input rows, with an average of 25,500 characters each and there is no consistent rhyme or reason for how they have decided to apply formatting.
Thanks again for your help, it's put me on the path to getting this last issue sorted.
Paul
Given that it's hard for you to provide sample data, when you already have half a million rows to choose from, imagine how difficult it is for me, when I have no sample data rows available.
I would think finding the rows with the most <br/> tags in them would have had a good chance of finding the more difficult rows to do, viz.: SELECT TOP (5) * FROM .... ORDER BY LEN(string) - LEN(REPLACE(string, '<br/>', '')) DESC
Again, good luck with this, I hope you work it all out successfully.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply