April 27, 2012 at 12:43 pm
Hi all!
I have been studying up on Mr. Moden's tally (numbers) table theory and the wonderful explanations and examples and input provided by many other folks here. I have been able to get a reasonably good handle on it's use.
I immediately saw an application for it in our environments, but I am having trouble getting to a workable point. I am converting to SQL Server 2008 R2 from SQL Server 2000 and getting new servers to boot.
I am attempting to convert appended data entries that are stored in a SQL Server 2000 TEXT datatype field into individual rows. The data in the TEXT field is separated by pipes and there are 5 fields per record. The records are just concatenated on as the transactional application adds them. The records can be of any length up to just shy of 2 GB and the length is never known until I begin to process it.
The kicker is that depending on the company I am processing, there can be almost 40 million TEXT entries that need to be split and stored in a new table.
Here is the table structure the data is currently stored in the source:
CREATE TABLE [dbo].[TEXTDATA](
[DATA_ID] [int] NOT NULL,
[TEXTDATA] [text] NOT NULL,
[INSERT_DT] [datetime] NOT NULL,
[UPDATE_DT] [datetime] NOT NULL,
CONSTRAINT [PK_DATA] PRIMARY KEY CLUSTERED
(
[DATA_ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [ix1_DATA] ON [dbo].[DATA]
(
[UPDATE_DT] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
GO
I have code working that will take every single delimited field and turn it into a single record per field, assigning a new key value and a sequence number to make sure they stay in order as processed. However, what I really need to know is can I use the tally table method to separate and process five fields into a single record instead of one per. Sample data would look something like this (sanitized to protect my company's data of course) in a TEXT datatype field (or VARCHAR(MAX) as I move the data to SQL Server 2008 R2:
|Notetype|jthomas|Joe Thomas|2012-03-01 10:32:23.888|There can be whatever text right here with no rhyme or reason whatever (but no pipes are in it generally)|AnotherNotetype|ccates|Cathi Cates|2012-03-01 10:32:23.888|yet another note right here with no rhyme or reason whatever|AnotherNotetype|bbthrocton||2012-03-01 10:42:23.777|and another note right here with no rhyme or reason whatever
The third field does not always have data and is the only field allowed to be NULL. The end of the TEXT string will never have a pipe at the end, though I check for it anyway - just in case.
Here is the structure I am trying to get to after parsing (using the data sample above (mostly)):
[DATA_ID] [int] NOT NULL,
[SEQUENCE_NUMBER] [INT] NOT NULL,
[TEXTDATA] [VARCHAR(45)] NOT NULL,
[TEXTDATA] [VARCHAR(45)] NULL, -- THIS FIELD CAN BE NULL
[TEXTDATA] [VARCHAR(45)] NOT NULL,
[TEXTDATA] [DATETIME] NOT NULL,
[TEXTDATA] [VARCHAR(MAX)] NOT NULL, -- THIS CAN BE LONGER THAN 8K CHARS AS IT IS A FREE FORM CHARACTER FORM
[INSERT_DT] [datetime] NOT NULL,
[UPDATE_DT] [datetime] NOT NULL
DATA_ID SEQUENCE_NUMBER DATAFIELD1 DATAFIELD2 DATAFIELD3 DATAFIELD4 DATAFIELD5 INSERT_DT UPDATE_DT
123 1 Notetype jthomas Joe Thomas 2012-03-01 10:32:23.888 There can ... date val date val
123 2 AnotherNotetype ccates Cathi Cates 2012-03-01 10:32:23.888 yet anothe... date val date val
123 3 AnotherNotetype bbthrocton 2012-03-01 10:42:23.777 and anothe... date val date val
etc...
I am sure there is a way to do this without a cursor, but I am still training myself about CTE's and the like, so I am using one for now. I am not going to put it all in here as it is sizable, but here is the code that currently splits all data into individual fields, including the cursor spec:
SET @cur = CURSOR FOR
SELECT --TOP 10
t.DATA_ID
, CAST(t.[TEXTDATA] AS VARCHAR(MAX))
, DATALENGTH(t.[TEXTDATA])
FROM TEXTDATA t
INNER JOIN keys k ON k.DATA_ID = t.DATA_ID
WHERE 1=1
--AND t.DATA_ID = @DATA_ID -- if I want to pass a single text_id to work on
--AND UPDATE_DT BETWEEN @start_dt AND @end_dt -- if I want to limit to a date range
ORDER BY t.DATA_ID
WHILE (@@FETCH_STATUS = 0)
BEGIN
------ debug code
-- print 'processing ' + cast(@DATA_ID as varchar(10))
-- check to see if the last char is a pipe - it should not be - then add it
IF NOT (SUBSTRING(@txt, @totlen, (@totlen-1)) = '|')
BEGIN
-- print 'adding pipe to close the string'
SELECT @txt = @txt + '|'
SELECT @totlen = @totlen + 1
END
INSERT INTO elements_tab
SELECT @DATA_ID
, ROW_NUMBER() OVER(ORDER BY @DATA_ID ASC) AS seq
, CAST(SUBSTRING(@txt, t.n+1, CHARINDEX(@delimiter, @txt, t.N+1) - t.N-1) AS VARCHAR(MAX)) AS value
FROM ldr_maintenance.dbo.Tally t
WHERE 1=1
AND t.N < @totlen
AND SUBSTRING(@txt,t.N,1) = @delimiter
FETCH NEXT
FROM @cur
INTO @DATA_ID, @txt, @totlen
END -- WHILE SECTION
I would greatly appreciate any input you can offer. Hopefully, I have provided the correct things to get your opinions.
Bestest!
John
April 27, 2012 at 1:55 pm
take a look at Jeff's text splitter. i think it will be close enough to what you need it will get you a big push to the finish line http://www.sqlservercentral.com/articles/Tally+Table/72993/ its a good write up on his new string splitter.
For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]
Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]
May 3, 2012 at 12:33 pm
Thanks for your response, cap'n. I have that article bookmarked as one of my main tally table resources. Maybe I am just being dense, but I am not seeing how to get the grouping that I am looking for to keep the 5 related records that have been split out together under one sequence number.
I am sure this is a mechanics question that is compounded by my limited knowledge of SQL Server 2008 R2.
Bestest!
John
May 3, 2012 at 4:16 pm
ok here is the code. it gets a little nasty and relies on there always being 5 elements to each series and an id.
--SET up some sample data
DECLARE @String TABLE (ID INT ,col VARCHAR(8000))
INSERT INTO @String
SELECT 1, '|Notetype|jthomas|Joe Thomas|2012-03-01 10:32:23.888|There can be whatever text right here with no rhyme or reason whatever (but no pipes are in it generally)|AnotherNotetype|ccates|Cathi Cates|2012-03-01 10:32:23.888|yet another note right here with no rhyme or reason whatever|AnotherNotetype|bbthrocton||2012-03-01 10:42:23.777|and another note right here with no rhyme or reason whatever' UNION ALL
SELECT 2, '|Notetype|jthomas|Joe Thomas|2012-03-01 10:32:23.888|There can be whatever text right here with no rhyme or reason whatever (but no pipes are in it generally)|AnotherNotetype|ccates|Cathi Cates|2012-03-01 10:32:23.888|yet another note right here with no rhyme or reason whatever|AnotherNotetype|bbthrocton||2012-03-01 10:42:23.777|and another note right here with no rhyme or reason whatever'
--The first CTE uses jeff's DelimitedSplit8K
;WITH CTE AS (
SELECT v.ID, s.ItemNumber, (s.ItemNumber -1) / 5 as Series, s.ItemNumber % 5 AS Modulose, s.Item FROM @String v
CROSS APPLY dbo.DelimitedSplit8K(RIGHT(v.col,LEN(v.col) - 1),'|') s),
--The second CTE create our pivot with a lot of nulls
PivotSetup AS (SELECT ID,Series,
CASE WHEN Modulose = 1 THEN Item ELSE NULL END AS NoteType,
CASE WHEN Modulose = 2 THEN Item ELSE NULL END AS UserName,
CASE WHEN Modulose = 3 THEN Item ELSE NULL END AS Name,
CASE WHEN Modulose = 4 THEN Item ELSE NULL END AS Datecreated,
CASE WHEN Modulose = 0 THEN Item ELSE NULL END AS Comment
FROM cte)
--the final query gets rid of all those nasty nulls
SELECT ID, Series, MAX(NoteType), MAX(UserName), MAX(Name), MAX(Datecreated), MAX(Comment) FROM PivotSetup GROUP BY ID, Series
ok an explination of the first cte: for the cross apply of the DelimitedSplit8K i took out the first | to give us a good starting point. the "(s.ItemNumber -1) / 5" gets us a set of records that are 5 rows long based on the item numbers from DelimitedSplit8K. "s.ItemNumber % 5" gets us rows numbered 1-0 based on the item number returned from DelimitedSplit8K. the pivot setup makes our pivot (using cross tabs) but puts in a lot of nasty nulls so we have to get those out to return the nice pretty rows we want so we take the max based on id and series in our final query.
its nasty and some one may be able to do a better job and i have not tested for any sort of performance but it works on the test string you have and if all the strings are like that then it will work.
For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]
Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply