March 14, 2017 at 10:49 am
Hello all,
We have a field in a table that's a string-valued field. In this field are (mainly) what I'll call 'tags'. These tags are pre-defined by our application, and there are quite a few. They can be arranged in any order in the field, they can use one or all or anywhere in between. [example: you have 3 tags called <remarks>,<boolresult>,<name>. this field can have '<boolresult><remarks>', '<name><boolresult>', '<remarks>', etc] The field can also include other characters such as colons, commas, etc.
We've found that some of these tags are misspelled, thus breaking the application. The problem is that it's not feasible to have people sift through them all manually as there are thousands upon thousands of records.
Is there a way to 'spell check' this field against what is known to be correct? I personally can't think of any, except maybe using a LIKE in a where clause with just about every single possible incorrect spelling, order of tags, and combination (which isn't feasible, either). Am I crazy to think there may be a way to do this?
Thanks!
March 14, 2017 at 10:55 am
scarr030 - Tuesday, March 14, 2017 10:49 AMHello all,We have a field in a table that's a string-valued field. In this field are (mainly) what I'll call 'tags'. These tags are pre-defined by our application, and there are quite a few. They can be arranged in any order in the field, they can use one or all or anywhere in between. [example: you have 3 tags called <remarks>,<boolresult>,<name>. this field can have '<boolresult><remarks>', '<name><boolresult>', '<remarks>', etc] The field can also include other characters such as colons, commas, etc.
We've found that some of these tags are misspelled, thus breaking the application. The problem is that it's not feasible to have people sift through them all manually as there are thousands upon thousands of records.
Is there a way to 'spell check' this field against what is known to be correct? I personally can't think of any, except maybe using a LIKE in a where clause with just about every single possible incorrect spelling, order of tags, and combination (which isn't feasible, either). Am I crazy to think there may be a way to do this?
Thanks!
This might be possible, but I would need DDL, sample data and expected results to give an answer. Check my signature on guidance on how to post this correctly.
March 14, 2017 at 11:02 am
Heh... spell checking tags. There's another good reason to avoid XML if I ever heard one. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
March 14, 2017 at 11:27 am
There is, however, a way to do this. I forget the exact code but if you open up XML as an XML Document, you can easily derive all of the tag names from the XML as a table and compare it to a list of authorized tag names.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 14, 2017 at 11:44 am
Luis Cazares - Tuesday, March 14, 2017 10:55 AMscarr030 - Tuesday, March 14, 2017 10:49 AMHello all,We have a field in a table that's a string-valued field. In this field are (mainly) what I'll call 'tags'. These tags are pre-defined by our application, and there are quite a few. They can be arranged in any order in the field, they can use one or all or anywhere in between. [example: you have 3 tags called <remarks>,<boolresult>,<name>. this field can have '<boolresult><remarks>', '<name><boolresult>', '<remarks>', etc] The field can also include other characters such as colons, commas, etc.
We've found that some of these tags are misspelled, thus breaking the application. The problem is that it's not feasible to have people sift through them all manually as there are thousands upon thousands of records.
Is there a way to 'spell check' this field against what is known to be correct? I personally can't think of any, except maybe using a LIKE in a where clause with just about every single possible incorrect spelling, order of tags, and combination (which isn't feasible, either). Am I crazy to think there may be a way to do this?
Thanks!
This might be possible, but I would need DDL, sample data and expected results to give an answer. Check my signature on guidance on how to post this correctly.
I hope I've provided enough data/info to help. I also hope I'm formatting this properly to help you help me. New to this, so apologies if I've made a mistake. The below is just a small sample of some of the data, which can be much more varied than what I've provided. Note that there are spelling errors such as '<REMAKR>' or '<BOOOLRESULT>'.
IF OBJECT_ID('TempDB..#TESTS','U') IS NOT NULL
DROP TABLE #TESTSCREATE TABLE #TESTS (
SENTENCE VARCHAR(255) NULL
)INSERT INTO #TESTS (SENTENCE)
SELECT ' <BOOLRESULT>. <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <RESULT> <REMARK>' UNION ALL
SELECT ' via <RESULT> route. <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOLRESULT>. <REMARK>.' UNION ALL
SELECT ' <GENDER> <RESULT>. <REMARK>' UNION ALL
SELECT ' <GENDER> has had <RESULT> in the past. <REMARK>.' UNION ALL
SELECT ' <RESULT> <REMAKR>' UNION ALL
SELECT ' <RESULT> <REMARK>' UNION ALL
SELECT ' <RESULT> <REMARK>' UNION ALL
SELECT ' <RESULT> <REARK>' UNION ALL
SELECT ' <RESULT> <REMARK>' UNION ALL
SELECT ' <RESULT> <REMARK>'
March 14, 2017 at 12:17 pm
Jeff Moden - Tuesday, March 14, 2017 11:27 AMThere is, however, a way to do this. I forget the exact code but if you open up XML as an XML Document, you can easily derive all of the tag names from the XML as a table and compare it to a list of authorized tag names.
I'm going to show my ignorance, here....
How is XML involved? Admittedly, I really don't like working with XML in SQL because I'm terrible at it! Really, I just need to read about it and practice it. I've had basically no exposure to it.
March 14, 2017 at 1:56 pm
scarr030 - Tuesday, March 14, 2017 12:17 PMJeff Moden - Tuesday, March 14, 2017 11:27 AMThere is, however, a way to do this. I forget the exact code but if you open up XML as an XML Document, you can easily derive all of the tag names from the XML as a table and compare it to a list of authorized tag names.I'm going to show my ignorance, here....
How is XML involved? Admittedly, I really don't like working with XML in SQL because I'm terrible at it! Really, I just need to read about it and practice it. I've had basically no exposure to it.
I suspect because you referred to them as "tags" and from the formatting of the information (<tag name> is a standard way of representing tags in an XML file)
March 14, 2017 at 2:24 pm
Ah! That makes sense. We can call them labels, if that's less confusing. There is no XML involved, as far as I'm aware. The labels/tags are used to input a value into a form template.
March 14, 2017 at 10:05 pm
First, apologies for thinking it was XML. Jason is correct. I got the word "tags" and the <somestring> format stuck in my head. Shifting gears to your problem...
Try this please...
First, here's a repeat of your test data. I just squared it up for personal use and I added an extra row to demonstrate what happens if more than one label per "sentence" is bad. I also added a "RowNum" column to the table just to make that final example more obvious and to ensure each row had something unique on it.
--===== If the test table already exists, drop it to make reruns in SSMS easier.
IF OBJECT_ID('tempdb..#Tests','U') IS NOT NULL
DROP TABLE #Tests
;
CREATE TABLE #Tests
(
RowNum INT IDENTITY(1,1) --Added this column
,Sentence VARCHAR(255) NULL
)
;
INSERT INTO #Tests
(Sentence)
SELECT ' <BOOLRESULT>. <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <RESULT> <REMARK>' UNION ALL
SELECT ' via <RESULT> route. <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOLRESULT> <REMARK>' UNION ALL
SELECT ' <BOOLRESULT>. <REMARK>.' UNION ALL
SELECT ' <GENDER> <RESULT>. <REMARK>' UNION ALL
SELECT ' <GENDER> has had <RESULT> in the past. <REMARK>.' UNION ALL
SELECT ' <RESULT> <REMAKR>' UNION ALL
SELECT ' <RESULT> <REMARK>' UNION ALL
SELECT ' <RESULT> <REMARK>' UNION ALL
SELECT ' <RESULT> <REARK>' UNION ALL
SELECT ' <RESULT> <REMARK>' UNION ALL
SELECT ' <RESULT> <REMARK>' UNION ALL
SELECT ' <Wup>! Der it is! <Not Allowed> but <Result> is.' --Added this
;
--===== Let's see what the test table looks like.
SELECT *
FROM #Tests
;
The next thing you need is the "Swiss Army Knife" for T-SQL in the form of an iTVF (Inline Table Valued Function) known as a "Tally Table". You can find out more about how it replaces a loop at the following URL. http://www.sqlservercentral.com/articles/T-SQL/62867/
Here's the code for it (and I'm working on a new one that will be a little bit faster and more flexible).
CREATE FUNCTION [dbo].[fnTally]
/**********************************************************************************************************************
Purpose:
Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Trillion.
As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.
Usage:
--===== Syntax example (Returns BIGINT)
SELECT t.N
FROM dbo.fnTally(@ZeroOrOne,@MaxN) t
;
Notes:
1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.
Refer to the following URLs for how it works and introduction for how it replaces certain loops.
http://www.sqlservercentral.com/articles/T-SQL/62867/
http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type
will cause the sequence to start at 1.
3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.
5. If @MaxN is negative or NULL, a "TOP" error will be returned.
6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger
number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with
that many values, you should consider using a different tool. ;-)
7. There will be a substantial reduction in performance if "N" is sorted in descending order. If a descending
sort is required, use code similar to the following. Performance will decrease by about 27% but it's still
very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.
If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.
DECLARE @MaxN BIGINT;
SELECT @MaxN = 1000;
SELECT DescendingN = @MaxN-N+1
FROM dbo.fnTally(1,@MaxN);
8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
Revision History:
Rev 00 - Unknown - Jeff Moden
- Initial creation with error handling for @MaxN.
Rev 01 - 09 Feb 2013 - Jeff Moden
- Modified to start at 0 or 1.
Rev 02 - 16 May 2013 - Jeff Moden
- Removed error handling for @MaxN because of exceptional cases.
Rev 03 - 22 Apr 2015 - Jeff Moden
- Modify to handle 1 Trillion rows for experimental purposes.
**********************************************************************************************************************/
(@ZeroOrOne BIT, @MaxN BIGINT)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN WITH
E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1) --10^1 or 10 rows
, E4(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c, E1 d) --10^4 or 10 Thousand rows
,E12(N) AS (SELECT 1 FROM E4 a, E4 b, E4 c) --10^12 or 1 Trillion rows
SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.
UNION ALL
SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E12 -- Values from 1 to @MaxN
;
Next, we need a reference table of allowed labels...
CREATE TABLE dbo.AllowedLabel
(
Label varchar(100) NOT NULL PRIMARY KEY CLUSTERED
)
;
GO
INSERT INTO dbo.AllowedLabel
(Label)
SELECT '<BOOLRESULT>' UNION ALL
SELECT '<GENDER>' UNION ALL
SELECT '<REMARK>' UNION ALL
SELECT '<RESULT>'
;
... and another iTVF, which uses the fnTally iTFV, to take care of the dirty work...
CREATE FUNCTION dbo.SpellCheckLabels
/**************************************************************************************************
Purpose:
Given a string with "labels" encapsulated as <string>, split the labels out of the string and check
them for existance in the dbo.AllowedLabel table. This inherently does a "spell check" on the
labels found. Any labels not found in the table will be returned along with their ordinal position
(LabelNumber) and their substring position and length.
Revision History:
Rev 00 - 14 Mar 2017 - Jeff Moden
**************************************************************************************************/
(@pString VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN WITH
cteSplit AS
(--===== This splits the labels out and returns the positional information
SELECT LabelNumber = ROW_NUMBER() OVER (ORDER BY t.N)
,Label = SUBSTRING(@pString,t.N,CHARINDEX('>',@pString,t.N)-t.N+1)
,LabelPosition = t.N
,LabelLength = CHARINDEX('>',@pString,t.N)-t.N+1
FROM dbo.fnTally(1,LEN(@pString)) t
WHERE SUBSTRING(@pString,t.N,1) = '<'
)--==== This checks the split-out labels against the table containing allowed labels
-- and only returns only dis-allowed labels.
SELECT LabelNumber, DisallowedLabel = split.Label, LabelPosition, LabelLength
FROM cteSplit split
WHERE NOT EXISTS (SELECT 1 FROM dbo.AllowedLabel allowed WHERE allowed.Label = Split.Label)
;
And then the rest is easy...
SELECT t.*, chk.*
FROM #Tests t
CROSS APPLY dbo.SpellCheckLabels(t.Sentence) chk
;
Here's the result set, which is a list of the misspelled/illegal labels...
[
RowNum Sentence LabelNumber DisallowedLabel LabelPosition LabelLength
------ ------------------------------------------------- ----------- --------------- ------------- -----------
9 <BOOOLRESULT> <REMARK> 1 <BOOOLRESULT> 2 13
11 <BOLRESULT> <REMARK> 1 <BOLRESULT> 2 11
15 <RESULT> <REMAKR> 2 <REMAKR> 11 8
18 <RESULT> <REARK> 2 <REARK> 11 7
21 <Wup>! Der it is! <Not Allowed> but <Result> is. 1 <Wup> 2 5
21 <Wup>! Der it is! <Not Allowed> but <Result> is. 2 <Not Allowed> 20 13
(6 row(s) affected)
--Jeff Moden
Change is inevitable... Change for the better is not.
March 17, 2017 at 8:28 pm
Just curious... is that what you were looking for?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 28, 2017 at 11:33 am
Jeff Moden - Friday, March 17, 2017 8:28 PMJust curious... is that what you were looking for?
Hi Jeff,
At first glance, yes. It looks like what we need. I'm reading through it all (including your tally table article - very interesting, btw) and will try to implement it soon.
Apologies for the delay. I was thrown into the midst of some real client-initiated BS that's consumed my life fore a couple of weeks.
Thank you very much, kind sir!
March 30, 2017 at 7:32 pm
You're welcome. It might sound a bit strange but (I have no life 😉 ) thank you for the interesting problem.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 30, 2017 at 8:49 pm
I don't know if what Jeff posted was helpful (another vanishing OP) but this is certainly doable. You need to do two things:
(1) Extract the element names from the XML into a table
(2) Write or obtain a CLR that does spellchecking
The first part is extremely simple.
USE tempdb
GO
-- A table to store the names
IF OBJECT_ID('dbo.XML_Tag') IS NOT NULL DROP TABLE dbo.XML_Tag;
CREATE TABLE dbo.XML_Tag(eName varchar(100)) -- adjust size as needed
GO
-- Some XML
DECLARE @someXML XML = '
<animal>
<cat>fluffy</cat>
<dog>
<nickname>spot</nickname>
</dog>
<fish>Billy</fish>
</animal>';
-- how to get the element names
INSERT dbo.XML_Tag
SELECT Element_Name = xx1.value('local-name(.)', 'varchar(100)')
FROM (VALUES (@someXML)) x(xx)
CROSS APPLY xx.nodes('//*') x1(xx1);
-- Note that, with multiple namespaces, this logic will need to be slightly tweeked
What we now have in dbo.XML_tag:
eName
----------------------------------------------------------------------------------------------------
animal
cat
dog
nickname
fish
The second part looks pretty easy but I'm not a CLR guy so who knows. I know you can leverage MS Offices's spell-checker functionality from a CLR. I did a quick search and found stuff like this. Either way - extracting the element names is simple. If you get stuck on the second part you can create a new thread in the CLR forum. There's some solid CLR ninjas on SSC.
-- Itzik Ben-Gan 2001
March 31, 2017 at 7:35 am
Alan.B - Thursday, March 30, 2017 8:49 PMI don't know if what Jeff posted was helpful (another vanishing OP) but this is certainly doable. You need to do two things:
(1) Extract the element names from the XML into a table
(2) Write or obtain a CLR that does spellchecking
You've made the same error that I did. It's not, in any way, shape, or form, XML. They're simply using <somestring> as labels within in non-XML string.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 31, 2017 at 3:06 pm
Jeff Moden - Friday, March 31, 2017 7:35 AMAlan.B - Thursday, March 30, 2017 8:49 PMI don't know if what Jeff posted was helpful (another vanishing OP) but this is certainly doable. You need to do two things:
(1) Extract the element names from the XML into a table
(2) Write or obtain a CLR that does spellcheckingYou've made the same error that I did. It's not, in any way, shape, or form, XML. They're simply using <somestring> as labels within in non-XML string.
Doh!
-- Itzik Ben-Gan 2001
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply