July 4, 2016 at 9:15 am
quentin.harris (7/4/2016)
Awesome.It works. Now have to get my head round it what it actually all does.
Thank you!
well to start with, I suggest that you run this
SELECT i.rowid, i.TextString, s.ItemNumber, s.Item
FROM #input i
CROSS APPLY dbo.DelimitedSplit8K(i.TextString,' ') s
then you can start to see how the rest builds up.
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
July 4, 2016 at 9:19 am
Thank you so much for your time.
July 5, 2016 at 8:56 am
You're welcome, Quentin.
I apologize for not posting a link to delimitedsplit8k. It is a tool which should be in everyone's toolkit. I should have posted more of an explanation as well. I was just ready to start my weekend. Also sorry for using LAG and TRY_CONVERT. I'm running on 2012, obviously.
Here is the basic logic of my original solution.
1. CTE: Read each row and use DelimitedSplit8k to parse it into individual strings.
2. CTE2: Use the CASE statement to identify when a string is like 'Admission%' or 'Discharge%, and if so, store the preceding element (LAG) in the CountX column. If a string is a valid date, store it in the DateX column.
3: The final query crosstabs the individual rows into columns on one row. The Group By Rowid keeps data for each original line together. The where clause eliminates strings that weren't significant.
J Livingston: Sorry about the white space issue. It lines up fine in my editor. Remember that perfect speed is being there. š
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
July 5, 2016 at 9:15 am
Just realized what you meant by the white space issue. The unformatted lines might have more than one space between the number and the word "Admission(s)" or "Discharge(s)". Adding a WHERE clause to cte2 fixes this.
,cte2 as(select rowID, Item
, casewhen Item like 'Admission%' or Item Like 'Discharge%'
then LAG(item,1,null) over(order by rowid,itemnumber) else null end as CountX
, TRY_CONVERT(date, replace(item,'.','')) as DateX
from cte
where item > '')
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
July 5, 2016 at 9:20 am
Thank you. I will see if i can place the new code in the right place and then run.
Q
July 5, 2016 at 10:11 am
Hello
Not sure if Iām missing something but could you not simplify it to :
select * into #input
from (values (1,'There have been 5 admissions and 3 discharges since 04/14/2016.')
,(2,'There has been 1 admission and 2 discharges since 04/22/2016.')
,(3,'There have been 2 admissions and 1 discharge since 04/29/2016.')
,(4,'There have been 2 discharges and 2 admissions since 05/15/2016.')
,(5,'There have been 2 discharges and 1 admission since 05/24/2016.')) dt (Rowid,TextString);
select * from #input;
SELECT ip.Rowid
, SUBSTRING(ip.TextString, PATINDEX('%[0-9][0-9]/[0-9][0-9]/20%',ip.TextString), 10) AS [Date]
-- With the assumption you may have up to 9 of each
, SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('admiss',REPLACE(ip.TextString, ' ', ''))-3, 3) AS Admissions
, SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('dischar',REPLACE(ip.TextString, ' ', ''))-3, 3) AS Discharges
-- With the assumption you may have up to 999 of each
, REPLACE(REPLACE(REPLACE(REPLACE(SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('admiss',REPLACE(ip.TextString, ' ', ''))-3, 3), 'n', ''), 'e', ''), 'd', ''), 's', '') AS Admissions
, REPLACE(REPLACE(REPLACE(REPLACE(SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('dischar',REPLACE(ip.TextString, ' ', ''))-3, 3), 'n', ''), 'e', ''), 'd', ''), 's', '') AS Discharges
FROM #input AS ip
July 5, 2016 at 10:34 am
Black Robin (7/5/2016)
HelloNot sure if Iām missing something but could you not simplify it to :
select * into #input
from (values (1,'There have been 5 admissions and 3 discharges since 04/14/2016.')
,(2,'There has been 1 admission and 2 discharges since 04/22/2016.')
,(3,'There have been 2 admissions and 1 discharge since 04/29/2016.')
,(4,'There have been 2 discharges and 2 admissions since 05/15/2016.')
,(5,'There have been 2 discharges and 1 admission since 05/24/2016.')) dt (Rowid,TextString);
select * from #input;
SELECT ip.Rowid
, SUBSTRING(ip.TextString, PATINDEX('%[0-9][0-9]/[0-9][0-9]/20%',ip.TextString), 10) AS [Date]
-- With the assumption you may have up to 9 of each
, SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('admiss',REPLACE(ip.TextString, ' ', ''))-3, 3) AS Admissions
, SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('dischar',REPLACE(ip.TextString, ' ', ''))-3, 3) AS Discharges
-- With the assumption you may have up to 999 of each
, REPLACE(REPLACE(REPLACE(REPLACE(SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('admiss',REPLACE(ip.TextString, ' ', ''))-3, 3), 'n', ''), 'e', ''), 'd', ''), 's', '') AS Admissions
, REPLACE(REPLACE(REPLACE(REPLACE(SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('dischar',REPLACE(ip.TextString, ' ', ''))-3, 3), 'n', ''), 'e', ''), 'd', ''), 's', '') AS Discharges
FROM #input AS ip
there are no doubt numerous ways to solve the OP question, particlulary when there appears to be "ALWAYS" a definitive way the string is constructed....(pls see earlier posts in this thread)......however, I was interested to see what I could come up with that deliverd the requried results, when the data in the string was a little "unstructred"....was only for my own benefit and a little bit of fun <grin>
here is the data I played around with...and yes, there are permutations that my code will fall over on....but I have already said that .
SELECT *
INTO #input
FROM (values
( 1, 'There were 5 admissions and 3 discharge since reporting 22/04/2016'),
( 2, 'today here were 6 discharges and 3 admission since reporting 21/04/2016'),
( 3, 'since reporting 30/04/2016 the discharges are 8 and the admissions are 10' ),
( 4, ' the discharges are 20 and the admissions are 30 on 10/05/2016 '),
( 5, ' the admission was 100 on 15/05/2016 and the discharge was 30 '),
( 6, ' on 15/08/2016 5 admissions and discharge 10' ),
( 7, ' 100 admission 200 discharges 25/12/2018' ),
( 8, ' discharge 500 on 01/01/2020 and 100 admission '),
( 9, ' discharge 1 2 admission 01/01/2020' ),
( 10, ' 20180101 100 admission 200 discharges' ),
( 11, '20160101 admission 1 2 discharges '),
( 12, 'here are the results up to 12/12/12 Admission 300 400 Discharge ' ) ,
( 13, ' Discharges 9000 20/May/2015 8000 Admissions ' ) ,
( 14, ' Discharges 99999 88888 Admissions 161216' ),
( 15, ' Discharges I think were 99 and maybe 88 possibly less Admissions on 2016-05-15' )
)
dt (rowID,TextString)
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
July 5, 2016 at 2:44 pm
Robin, you can absolutely do it your way. As JLS said, there are many ways to solve a problem. There are always tradeoffs though. For example, your code assumes that the date will always use slashes as separators (M/D/Y). The Try_Convert function will catch any valid (single-string) date format, but will not trap invalid dates.
Your solution would in fact run faster than mine by about a second for every 10,000 rows, so for the problem as stated I would adopt it for production, because, hey, faster is faster.
JLS: I want to try to solve your new input set when time allows.
Fun problem, but again GET THE INPUTS STANDARDIZED.... this whole exercise should be unnecessary.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
July 5, 2016 at 2:59 pm
The Dixie Flatline (7/5/2016)
JLS: I want to try to solve your new input set when time allows.
look fwd to the "challenge" Dixie....good fun me thinks
just remember as per OP previous post .....
I will try to adapt what has been posted to suit our sandpit environment where i have discovered that some functions are not available (LAG and TRY_CONVERT).
Seems 2012 is not used for the sandpit which limits things.
game on <grin>
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
July 5, 2016 at 3:10 pm
Another approach would be to split the text into word-level bi-grams using NGrams8K[/url]. See the article for the function and a description of what a word-level bi-gram is.
Using Ngrams8K you can create the following functions:
CREATE FUNCTION dbo.WBIGrams(@string varchar(8000))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
/*
SELECT TokenNumber, Token
FROM dbo.WBIGrams('There have been 5 admissions and 3 discharges since 04/14/2016.')
*/
WITH
delim(RN,N) AS -- locate all of the spaces in the string
(
SELECT 0,0 UNION ALL
SELECT ROW_NUMBER() OVER (ORDER BY position), position
FROM dbo.NGrams8k(RTRIM(LTRIM(@string)),1)
WHERE token = CHAR(32)
),
tokens(tokenNumber, tokenStartPosition, token, tokenCount) AS -- Create the tokens
(
SELECT
N1.RN+1,
N1.N+1,
SUBSTRING(@string, N1.N+1, ISNULL(N2.N-(N1.N+1), 8000)),
(SELECT COUNT(*)-1 FROM delim)
FROM delim N1
LEFT JOIN delim N2 ON N2.RN = N1.RN+2
)
SELECT t.tokenNumber, tokenStartPosition, t.token
FROM tokens t
WHERE tokenNumber <= tokenCount;
GO
CREATE FUNCTION dbo.WhiteSpaceCleanup(@String varchar(max))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
/*
DECLARE
@string varchar(8000) = 'ABC 123'+CHAR(13)+CHAR(10)+'xxx xxx xx xxx xx';
SELECT * FROM dbo.WhiteSpaceCleanup(@string);
--Results: ABC 123 xxx xxx xx xxx xx
*/
SELECT NewString =
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
LTRIM(RTRIM(REPLACE(REPLACE(REPLACE(REPLACE(
@String,CHAR(9),' '),CHAR(10),' '),CHAR(13),' '),CHAR(160),' '))),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' ');
First for updated sample data:
select * into #input
from (values
(1,'There''s been 5 admissions then another another 200 admissions, and 3 discharges since 04/14/2016.')
,(2,'There has been 11 admission and 2 discharges since 04/22/2016.')
,(3,'There have been 2 admissions and 1 discharge since 04/29/2016.')
,(4,'There have been 2 discharges 50 admissions admissions since 05/15/2016.')
,(5,'There have been 16 discharges and 1 admission since 05/24/2016.')) dt (Rowid,TextString);
ALTER TABLE #input ALTER COLUMN Rowid int NOT NULL;
ALTER TABLE #input ADD CONSTRAINT pk_input PRIMARY KEY CLUSTERED (Rowid);
Here's a few queries to understand my solution. This first query turns tabs, line feeds, carriage returns into spaces then removes duplicate spaces.
-- (1) Remove whitespace
SELECT RowID, NewString
FROM #input i
CROSS APPLY dbo.WhiteSpaceCleanup(i.TextString);
RowID NewString
------ -------------------------------------------------------------------------------------------------
1 There's been 5 admissions then another another 200 admissions, and 3 discharges since 04/14/2016.
2 There has been 11 admission and 2 discharges since 04/22/2016.
3 There have been 2 admissions and 1 discharge since 04/29/2016.
...
This second query builds off the first to "extract" the admissions and discharges
-- (2) Get the date and the bigrams that contain the admissions or discharges
WITH cleanedStrings AS
(
SELECT RowID, NewString
FROM #input i
CROSS APPLY dbo.WhiteSpaceCleanup(i.TextString)
)
SELECT
RowID,
token,
ActivityDate = SUBSTRING(NewString,
PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%', NewString), 10)
from cleanedStrings i
CROSS APPLY dbo.WBIGrams(i.NewString)
WHERE token LIKE '[0-9]%'
AND (token LIKE '%[0-9] admission%' OR token LIKE '%[0-9] discharge%');
RowID token ActivityDate
----------- --------------- ---------------
1 5 admissions 04/14/2016
1 200 admissions, 04/14/2016
1 3 discharges 04/14/2016
2 11 admission 04/22/2016
2 2 discharges 04/22/2016
3 2 admissions 04/29/2016
3 1 discharge 04/29/2016
...
And here's the final solution:
WITH
cleanedStrings AS -- Let's first remove duplicate spaces, line feeds, tabs and carriage returns
(
SELECT RowID, NewString
FROM #input i
CROSS APPLY dbo.WhiteSpaceCleanup(i.TextString)
)
select
i.RowID,
Admission = SUM(CASE WHEN token LIKE '%admission%'
THEN CAST(LEFT(token,CHARINDEX(' ', token)) AS int) END),
Discharge = SUM(CASE WHEN token LIKE '%discharge%'
THEN CAST(LEFT(token,CHARINDEX(' ', token)) AS int) END),
ActivityDate = MAX
(
SUBSTRING(NewString,
PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%', NewString), 10)
)
FROM cleanedStrings i
CROSS APPLY dbo.WBIGrams(i.NewString)
WHERE token LIKE '[0-9]%'
AND (token LIKE '%[0-9] admission%' OR token LIKE '%[0-9] discharge%')
GROUP BY i.Rowid;
-- Itzik Ben-Gan 2001
July 5, 2016 at 3:34 pm
Hi Alan..have been following your recent scripts and am suitably impressed.
its now
bed o'clock here and busy tomorrow...but hoping I can work through and understand your code later tomorrow.
JLS
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
July 5, 2016 at 6:47 pm
J Livingston SQL (7/5/2016)
Hi Alan..have been following your recent scripts and am suitably impressed.
Thanks a lot J. That means a lot!
-- Itzik Ben-Gan 2001
July 6, 2016 at 1:18 am
Hi,
Thank you all. I am a little overwhelmed, in a very nice way, with all the help and new SQL to pick apart.
You are an awesome crowd.
š
Q
July 6, 2016 at 10:24 am
J Livingston SQL (7/5/2016)
Hi Alan..have been following your recent scripts and am suitably impressed.its now
bed o'clock here and busy tomorrow...but hoping I can work through and understand your code later tomorrow.
JLS
Hi Alan
have had a play with your code (I see its been marked as answer)
as I have previously said....I was looking at expanding the solution to cover other permutations.
I maybe be mistaken.....but please take a look at additional sample data in your code below.....doesnt deliver for me.
thoughts?
select * into #input
from (values
(1,'There''s been 5 admissions then another another 200 admissions, and 3 discharges since 04/14/2016.')
,(2,'There has been 11 admission and 2 discharges since 04/22/2016.')
,(3,'There have been 2 admissions and 1 discharge since 04/29/2016.')
,(4,'There have been 2 discharges 50 admissions admissions since 05/15/2016.')
,(5,'There have been 16 discharges and 1 admission since 05/24/2016.')
-- additional rows
,( 10, ' 20180101 100 admission 200 discharges' )
,( 11, '20160101 admission 1 and 2 discharges ')
) dt (Rowid,TextString);
ALTER TABLE #input ALTER COLUMN Rowid int NOT NULL;
ALTER TABLE #input ADD CONSTRAINT pk_input PRIMARY KEY CLUSTERED (Rowid);
WITH
cleanedStrings AS -- Let's first remove duplicate spaces, line feeds, tabs and carriage returns
(
SELECT RowID, NewString
FROM #input i
CROSS APPLY dbo.WhiteSpaceCleanup(i.TextString)
)
select
i.RowID,
Admission = SUM(CASE WHEN token LIKE '%admission%'
THEN CAST(LEFT(token,CHARINDEX(' ', token)) AS int) END),
Discharge = SUM(CASE WHEN token LIKE '%discharge%'
THEN CAST(LEFT(token,CHARINDEX(' ', token)) AS int) END),
ActivityDate = MAX
(
SUBSTRING(NewString,
PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%', NewString), 10)
)
FROM cleanedStrings i
CROSS APPLY dbo.WBIGrams(i.NewString)
WHERE token LIKE '[0-9]%'
AND (token LIKE '%[0-9] admission%' OR token LIKE '%[0-9] discharge%')
GROUP BY i.Rowid;
drop table #input
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
July 6, 2016 at 1:00 pm
Hey JLS, I think I have a solution, but for some strange reason the website won't let me post or upload the code if I spell out TRY_CONVERT more than once.
Do a search_replace on the code below and let me know.
Key points:
1. Expanded rules for what can be considered a date string.
2. Used PATINDEX to determine if admissions came before discharges in CTE ([AdmissionFirst])
3. Used ROW_NUMBER() in CTE2 to make sure that DateX was always the last row for a given RowID and that the order of integers in [textstring] was preserved.
4. Used CASE statements in the final query to decide which integers were used for admissions and which for discharges, based on the value of the AdmissionFirst flag
---------------------------------------------------------------------------------------------------------------------------------------------------------
-- Search/replace "try_C(" to the correct function
---------------------------------------------------------------------------------------------------------------------------------------------------------
with cte as (select rowID, itemnumber,convert(varchar(50),item) as item
, case when PATINDEX('%Admission%Discharge%',TextString) > 0 then 'Y' else '' end as AdmissionFirst
, case when item like '%/%/%' or (try_C(date,Item) is not null and len(item) > 5) then item end as DateX
from #input i
cross apply DelimitedSplit8K(TextString, ' ')
)
,cte2 as (select *, ROW_NUMBER() over(Partition by RowID order by rowID,datex, ItemNumber) as RowNum
from cte
where item > ''
and (try_C(int,Item) is not null
or Item like '%/%/%'
or try_C(date,Item) is not null ))
select RowID
,max(case when AdmissionFirst = 'Y' and RowNum = 1 then Item
when AdmissionFirst <> 'Y' and RowNum = 2 then Item
end) as Admissions
,max(case when AdmissionFirst = 'Y' and RowNum = 2 then Item
when AdmissionFirst <> 'Y' and RowNum = 1 then Item
end) as Discharges
,max(DateX) as DateX
from cte2
group by RowID
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
Viewing 15 posts - 16 through 30 (of 37 total)
You must be logged in to reply to this topic. Login to reply