May 18, 2009 at 4:25 pm
I have a text field containing from one to x number of dates along with comments such as:
'8/9/08 - out for upgrade; 8/16/08 returned to inventory. 8/31/2008 - deployed to field.'
I need to extract all dates, as well as all associated status notes from the table into separate DATE & TEXT fields into a new target table.
Any suggestions as to how to accomplish this would be MUCH appreciated!
Thank you!
May 18, 2009 at 5:38 pm
By "text" field, do you mean varchar? If so, is it in a SQL Server table? Is there a common delimiter among date/status groupings, like a semicolon or comma? Is there one status field per date?
May 18, 2009 at 7:01 pm
first get everything into a temp table, and make sure you convert to varchar(max)...it's a lot of trouble fiddling with TEXT datatypes.
something simply like SELECT convert(varchar(max),YourTextField) as BetterTextField,* into #tmp From YourTable
then you can do something with charindex and pattern matching, like '%[0-9]/[0-9]/[0-9][0-9][0-9][0-9]%' for1/1/xxxx type dates, with multiple passes for two digit months and year combinations.
That should help to find the start of your date and comment, but you've got to be able to figure out where one item ends and the next begins.
is it always {date} {space}some text ?
if we can establish a pattern, we can use a tally table to pull out all the strings, then parse the strings for date/comment separately. I've done that parsing and split for web pages and dictionary definitions and similar stuff.
can you give us some sample data for the text fields?
Lowell
May 19, 2009 at 5:12 am
If you have access to visual studio .NET it may be worth looking at a CLR function. You will have access to the regular expression class which will make pattern maching easier.
http://blogs.msdn.com/sqlclr/archive/2005/06/29/regex.aspx
I found that one useful.
May 19, 2009 at 9:18 am
Thank you Dave, Lowell & Tom for your replies.
Here are the answers to your questions:
Dave:
The source table is SQL Server, and the field is TEXT.
Unfortunately there is no common delimeter in the field, so I will have to take multiple passes for each type of entry - it may be a space, a comma or a semicolon. And there is a status for each date entry.
Lowell:
Thank you for the info on the trouble with the TEXT datatype. I'll take your recommendation to load to a temp table and convert it to varchar.
I did already start down the path of the CHARINDEX with the '%[0-9]/[0-9]/[0-9][0-9][0-9][0-9]%' , so guess I was headed down the right path afterall.
Here are a couple of actual entries that show highlight the challenge, NO consistancy!:
08/04/2003 - REC'D JOB FROM KAYLA - REQUEST FOR PDC TO GET R/W; 9/7/04 JON CANCELLED PER CRAIG BULLOCK BECAUSE OWNER CANNOT BE REACHED (LISA MURPHY OF PDC)
08/21/2003 - RESEARCH GIGI. NEEDS EXHIBIT A & R/W TO BE OBTAINED BY PDC. 09/12/2003: REC'D NOTE FROM PDC THAT SURVEY WAS COMPLETED.09/16/2003 - FIELD NOTES TO MI FILE, COPY OF DWG SENT TO J. TODD
Tom:
I do have access to Visual Studio, just have minimal knowledge of using it. I'll check out the link and see if I can figure it out as well.
May 19, 2009 at 9:58 am
I'm fiddling with this, and if we have consistent delimiters, it's REALLY easy to pull out with a tally table.
my dilemma is that it looks like multiple passes of updates would have to occur to insert delimiters before and after each possible date pattern.
here's what I'm trying so far: I'm trying to exclude a charindex where the pipe character is part of the pattern...i thought [^\|] woudl work, but it is not...
see if this gets anyone's ideas working better than mine:
create table #example(exampleid int,
exampletext varchar(max)
)
insert into #example
SELECT 1,'08/04/2003 - REC''D JOB FROM KAYLA - REQUEST FOR PDC TO GET R/W; 9/7/04 JON CANCELLED PER CRAIG BULLOCK BECAUSE OWNER CANNOT BE REACHED (LISA MURPHY OF PDC)'
UNION ALL
SELECT 2,'08/21/2003 - RESEARCH GIGI. NEEDS EXHIBIT A & R/W TO BE OBTAINED BY PDC. 09/12/2003: REC''D NOTE FROM PDC THAT SURVEY WAS COMPLETED.09/16/2003 - FIELD NOTES TO MI FILE, COPY OF DWG SENT TO J. TODD'
UNION ALL
SELECT 1,'8/9/08 - out for upgrade; 8/16/08 returned to inventory. 8/31/2008 - deployed to field.'
--x/x/xx dates
UPDATE #example
SET exampletext = SUBSTRING(exampletext,1,CHARINDEX('[^\|][0-9]/[0-9]/[0-9][0-9][^0-9]',exampletext))
+ '|'
+ SUBSTRING(exampletext,CHARINDEX('[^\|][0-9]/[0-9]/[0-9][0-9][^0-9]',exampletext),8000)
Lowell
May 19, 2009 at 10:19 am
Tim,
If you can use something other than T-SQL to build a solution to your problem (C#, VB) it will be easier. If so, you should look at regular expressions that can find the date values and insert a delimiter. Then take whatever is left as the comments. I've done a few gigs like this, and always used C# or VB6 to read the data, perform the transformation, and write to a table. Good luck.
May 20, 2009 at 3:00 am
Hi Tim
I took Lowells advice to use a tally table based solution in a nested style (first seen from Bob).
Thanks for the test data!
This should work. I just don't know about the performance with very large data or a large count of rows.
[font="Courier New"]DECLARE @t TABLE (Id INT NOT NULL IDENTITY, Txt TEXT)
INSERT INTO @t
SELECT '8/9/08 8/31/08'
UNION ALL SELECT '08/04/2003 - REC''D JOB FROM KAYLA - REQUEST FOR PDC TO GET R/W; 9/7/04 JON CANCELLED PER CRAIG BULLOCK BECAUSE OWNER CANNOT BE REACHED (LISA MURPHY OF PDC)'
UNION ALL SELECT '08/21/2003 - RESEARCH GIGI. NEEDS EXHIBIT A & R/W TO BE OBTAINED BY PDC. 09/12/2003: REC''D NOTE FROM PDC THAT SURVEY WAS COMPLETED.09/16/2003 - FIELD NOTES TO MI FILE, COPY OF DWG SENT TO J. TODD'
UNION ALL SELECT '8/9/08 - out for upgrade; 8/16/08 returned to inventory. 8/31/2008 - deployed to field.'
; WITH
t (Id, Txt) AS
(
SELECT
Id,
',' + CONVERT(VARCHAR(MAX), Txt) + ','
FROM @t
)
SELECT
Id,
t2.Item
FROM t
CROSS APPLY
(
SELECT
CONVERT(DATETIME, SUBSTRING(t.Txt, tally1.N + 1, tally2.N - (tally1.N + 1))) Item
FROM Tally tally1
CROSS APPLY
(
SELECT TOP(1)
N
FROM Tally
WHERE
N > tally1.N
AND N <= LEN(t.Txt)
AND SUBSTRING(t.Txt, N, 1) IN (',', ';', ' ', ':')
ORDER BY N
) tally2
WHERE tally1.N < LEN(t.Txt)
AND SUBSTRING(t.Txt, tally1.N, 1) IN (',', ';', ' ', ':')
AND ISDATE(SUBSTRING(t.Txt, tally1.N + 1, tally2.N - (tally1.N + 1))) = 1
) t2
[/font]
Greets
Flo
Edit: Just noticed the colon as possible delimiter
May 20, 2009 at 2:32 pm
Thank you Lowell & Flo.....
I'll work with this code to see if I can't get it to work!
May 20, 2009 at 2:39 pm
Hi Tim
I didn't notice that this was the "Newbie" forum...
If you have any questions, feel free to come back here!
Greets
Flo
May 20, 2009 at 2:44 pm
Thanks again Flo.....
I may need to take you up on that! The code as is gives me some errors that I am trying to debug now.
May 20, 2009 at 2:53 pm
Errors? I just copy/pasted into an empty query window and executed. I don't get any errors.
Err, I think I know what you need...
You probably don't have a Tally table. Execute this script and the previous should work:
CREATE TABLE dbo.Tally (N INT NOT NULL)
INSERT INTO dbo.Tally
SELECT TOP(11000)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master.sys.all_columns c1
CROSS JOIN master.sys.all_columns c2
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally
PRIMARY KEY CLUSTERED
(N)
WITH (FILLFACTOR = 100)
To know how a tally table works you should read this article by Jeff Moden:
http://www.sqlservercentral.com/articles/T-SQL/62867/
Greets
Flo
May 22, 2009 at 4:48 pm
Well after scratching my head for a bit as to why the code wasn't working, found out that the source server is SQL 2000. :blush:
Guessing some of these commands are new in SQL 2005 and don't work in 2000.:unsure:
So unless anyone has any more ideas on how to handle on 2000, just going to have to foregoe the parsing.:satisfied:
Thanks for everyone's help....in particular you Flo!
May 23, 2009 at 2:06 am
Hi Tim
If you are not sure if you are not sure about your SQL Server version you can execute this statement. It shows the detailed version information:
PRINT @@VERSION
If you use SSE2k you are not able to use CROSS APPLY. Instead of this you have to use a CROSS JOIN. In addition you have to ensure that your source data are enclosed with a comma (or another specified delimiter).
Here a SSE2k version of my previous script. I added some comments to explain how it works:
[font="Courier New"]---==========================
-- Create a tally table if not yet done
IF (OBJECT_ID('dbo.Tally') IS NULL)
BEGIN
SELECT TOP 11000
IDENTITY(INT, 1, 1) AS N
INTO dbo.Tally
FROM MASTER.dbo.syscolumns c1
CROSS JOIN MASTER.dbo.syscolumns c2
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally
PRIMARY KEY CLUSTERED
(N)
WITH FILLFACTOR = 100
END
GO
---==========================
-- Your source table
DECLARE @t TABLE (Id INT NOT NULL IDENTITY, Txt VARCHAR(8000))
INSERT INTO @t
SELECT '8/9/08 8/31/08'
UNION ALL SELECT '08/04/2003 - REC''D JOB FROM KAYLA - REQUEST FOR PDC TO GET R/W; 9/7/04 JON CANCELLED PER CRAIG BULLOCK BECAUSE OWNER CANNOT BE REACHED (LISA MURPHY OF PDC)'
UNION ALL SELECT '08/21/2003 - RESEARCH GIGI. NEEDS EXHIBIT A & R/W TO BE OBTAINED BY PDC. 09/12/2003: REC''D NOTE FROM PDC THAT SURVEY WAS COMPLETED.09/16/2003 - FIELD NOTES TO MI FILE, COPY OF DWG SENT TO J. TODD'
UNION ALL SELECT '8/9/08 - out for upgrade; 8/16/08 returned to inventory. 8/31/2008 - deployed to field.'
-- We need a leading and trailing delimiter
UPDATE @t SET Txt = ',' + Txt + ','
---===========================
-- Split the source data and return the dates
SELECT
t.Id,
-- Get the text part between startPos and endPos
CONVERT(DATETIME, SUBSTRING(t.Txt, startPos.N + 1, MIN(endPos.N) - (startPos.N + 1))) Item
FROM @t t
-- First tally table to scan for date start position
CROSS JOIN Tally startPos
-- Second tally table to scan for date end
CROSS JOIN Tally endPos
WHERE
-- The start position has to be less than complete length of text
startPos.N < LEN(t.Txt)
-- Determine if the current position is one of the specified delimiter
AND SUBSTRING(t.Txt, startPos.N, 1) IN (',', ';', ' ', ':')
-- For some reason I got duplicate values in case of ". " start position
-- You can try to remove this
AND SUBSTRING(t.Txt, startPos.N + 1, 1) NOT IN (',', ';', ' ', ':')
-- The start position has to be less than the end position
AND startPos.N < endPos.N
-- The end position has to be sess than or equal to len of text
AND endPos.N <= LEN(t.Txt)
-- End position has to be one of hte specified delimiters
AND SUBSTRING(t.Txt, endPos.N, 1) IN (',', ';', ' ', ':')
-- The text between start and end position has to be a date value
AND ISDATE(SUBSTRING(t.Txt, startPos.N + 1, endPos.N - (startPos.N + 1))) = 1
GROUP BY
t.Id,
t.Txt,
startPos.N
ORDER BY
t.Id,
startPos.N
[/font]
Greets
Flo
May 23, 2009 at 9:52 pm
The TEXT column in question is that of a "log" maintained by humans. Likely as not, it has some hidden delimiters in it. Please check the data for the presence of CHAR(10) and CHAR(13) separately. Might want to throw in the additional check for CHAR(9), as well. The presence of these characters would greatly simplify this problem.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply