March 31, 2008 at 9:35 am
I would swear I've done this before, but I'm drawing a blank today.
I have a table with lots of rows with a column of data that contains 6 elements delimited by the pipe character.
Instead of one field, i need the table split by the delimiter. I think the solution invoves a sub select, but I'll be darned if I remember how today.
Here's some example data, and what i was getting using a tally table.
I'm looking for results similar to this:
[font="Courier New"] col1 col2 ol3 col4 col5 col6
1000000 Timmerman Pond Dam Y NULL NULL U.S. Army Corps blah blah.....[/font]
[font="Courier New"]
--tally table if needed:
--===== Create and populate the Tally table on the fly
SELECT TOP 11000 --equates to more than 30 years of dates
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM MASTER.dbo.SysColumns sc1, MASTER.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
CREATE TABLE rawdata(rawinput VARCHAR(8000))
INSERT INTO rawdata
SELECT '1000000|Timmerman Pond Dam|Y|||U.S. Army Corps of Engineers. Dams ....' UNION
SELECT '1000001|Thomas Lake|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION
SELECT '1000002|Thomas Lake Dam|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION
SELECT '1000003|Walsh Fishing Lake|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION
SELECT '1000004|Walsh Fishing Lake Dam|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. '
--assuming you have a tally table, this returns each element as a row, but I'm looking for a table with 5 columns
SELECT t.n,
SUBSTRING(yt.rawinput+'|', t.n,CHARINDEX('|', yt.rawinput+'|', t.n) - t.n)
FROM rawdata AS yt, dbo.Tally t
WHERE t.n <= LEN(yt.rawinput)
AND SUBSTRING('|' + yt.rawinput,t.n, 1) = '|'
AND
DATALENGTH(rawinput) - DATALENGTH(REPLACE(rawinput,'|','')) = 5[/font]
Lowell
March 31, 2008 at 11:15 am
Probably the simplest solution is to use Pivot after you break it up into rows. Otherwise, it's five sub-queries, one for each column.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 31, 2008 at 11:28 am
if you can be sure your data has no < symbols (or can live with a character swap), the following approach will work. you can use any function that splits a string into a table set.
-- function to split list by delimiter
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
alter function [dbo].[fListToText]( @list varchar(max), @delim varchar(max) = ',')
returns @returnTable table
( item varchar(255) not null )
as begin
declare @xml XML
set @xml = '<item>' + REPLACE(@list, @delim, '</item><item>') + '</item>'
insert into @returnTable
SELECT data.item.value('.','varchar(255)')
FROM @xml.nodes('//item') as data(item)
return
end
go
select rawinput,
min(case when seq % 6 = 0 then item else null end) as field1,
min(case when seq % 6 = 1 then item else null end) as field2,
min(case when seq % 6 = 2 then item else null end) as field3,
min(case when seq % 6 = 3 then item else null end) as field4,
min(case when seq % 6 = 4 then item else null end) as field5,
min(case when seq % 6 = 5 then item else null end) as field6
from
(
select rawinput, item, row_number() over ( order by rawinput ) as seq
from rawdata cross apply dbo.fListToText( R.rawinput, '|' ) ) as data
group by rawinput
March 31, 2008 at 12:02 pm
that's what i was looking for antonio; i remember using a different kind of split function for the same type of results. it's the cross apply and the row_number() over that i forgot about using. too much sql 2000, not enough 2005.
Thank you!
Lowell
March 31, 2008 at 9:03 pm
CROSS APPLY = CORRELATED SUBQUERY
CORRELATED SUBQUERY = RBAR
RBAR + UDF + TABLE VARIABLE = S....L....O....W
Heh... Lowell!!! :blink: You of all people should know that the Tally Table method will blow the doors off every other split there is! Just on 5 rows, the XML UDF method takes about 7 times longer than the Tally Table... try this out, buddy... it also happens to give you the exact result set you asked for. No extra columns... no data out of order...
CREATE TABLE rawdata(rawinput VARCHAR(8000))
INSERT INTO rawdata
SELECT '1000000|Timmerman Pond Dam|Y|||U.S. Army Corps of Engineers. Dams ....' UNION
SELECT '1000001|Thomas Lake|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION
SELECT '1000002|Thomas Lake Dam|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION
SELECT '1000003|Walsh Fishing Lake|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION
SELECT '1000004|Walsh Fishing Lake Dam|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. '
;WITH cteMySplit AS
(
SELECT DENSE_RANK() OVER (ORDER BY RawInput) AS RowNum,
ROW_NUMBER() OVER (PARTITION BY RawInput ORDER BY RawInput)-1 AS ColNum,
NULLIF(SUBSTRING('|'+h.RawInput+'|', t.N+1, CHARINDEX('|', '|'+h.RawInput+'|', t.N+1)-t.N-1),'') AS String
FROM dbo.Tally t,
dbo.RawData h
WHERE SUBSTRING('|'+h.RawInput+'|', t.N, 1) = '|'
AND t.N < LEN('|'+h.RawInput)
)
SELECT MAX(CASE WHEN ColNum % 6 = 0 THEN String ELSE NULL END) AS Col1,
MAX(CASE WHEN ColNum % 6 = 1 THEN String ELSE NULL END) AS Col2,
MAX(CASE WHEN ColNum % 6 = 2 THEN String ELSE NULL END) AS Col3,
MAX(CASE WHEN ColNum % 6 = 3 THEN String ELSE NULL END) AS Col4,
MAX(CASE WHEN ColNum % 6 = 4 THEN String ELSE NULL END) AS Col5,
MAX(CASE WHEN ColNum % 6 = 5 THEN String ELSE NULL END) AS Col6
FROM cteMySplit
GROUP BY RowNum
--Jeff Moden
Change is inevitable... Change for the better is not.
April 1, 2008 at 8:44 am
I'm putting this thru it's paces now Jeff;
The data I'm trying to parse is some FIP55 data with every place name in the united states; 560 meg of data, so the sql is taking a while to run(+10 minutes on a dev server) ; I couldn't get my mental arms around this one, thanks.
Lowell
April 1, 2008 at 9:18 am
why not just create a 'real' table, bcp out the rawdata and then bcp that back in with a pipe (|) delimiter to the 'real' table?
April 1, 2008 at 9:28 am
Spot on Antonio... real key is that the data should never have come into the database in an unsplit fashion to begin with. They should use Bulk Insert or BCP to load the raw data.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 1, 2008 at 9:30 am
Lowell (4/1/2008)
I'm putting this thru it's paces now Jeff;The data I'm trying to parse is some FIP55 data with every place name in the united states; 560 meg of data, so the sql is taking a while to run(+10 minutes on a dev server) ; I couldn't get my mental arms around this one, thanks.
Then, I agree with Antonio... the data should not be in this condition in the database to begin with... the data should be loaded from a file using Bulk Insert or BCP using "|" as the delimiter. Bulk Insert will load 5.1 million 20 column rows in 60 seconds.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 1, 2008 at 9:51 am
boy I really agree; It's a stupid Project requirement that all "raw data" is imported from text into SQL Server database, and then the data massaged from there.
The theory is If I got struck by lightning, anyone could re-run any step in the scripts again to duplicate the work i'm doing, and not need my machine and specific paths to the text files... in case something wasn't right or needed adaptation.
which really makes no sense, since the script to bulk insert the raw data is path specific anyway; i had to bulk import it on a tilde ~ separator that I knew didn't exist in the data, instead of the pipe character.
Lowell
April 1, 2008 at 10:03 am
then you should be able to import the same file into both the "rawdata" table and the 'real/parsed' table in the same load script. (you'll need a seperate fmt file to skip fields and handle different field terminators.)
if the load scripts can't be modified, you can write an extended stored proc that determines where the temp path is, exports "rawdata" to it, then calls bcp to import that file, then deletes it.
convoluted? yes. but that's the problem with stupid 'requirements'. don't get me started on that subject again.
April 1, 2008 at 10:25 am
Lowell (4/1/2008)
boy I really agree; It's a stupid Project requirement that all "raw data" is imported from text into SQL Server database, and then the data massaged from there.The theory is If I got struck by lightning, anyone could re-run any step in the scripts again to duplicate the work i'm doing, and not need my machine and specific paths to the text files... in case something wasn't right or needed adaptation.
which really makes no sense, since the script to bulk insert the raw data is path specific anyway; i had to bulk import it on a tilde ~ separator that I knew didn't exist in the data, instead of the pipe character.
Lowell... Can you attach the first, say, 100 rows or so as a file to your next post? Dunno about Antonnio, but I wouldn't mind taking a crack at it.
Also, can you outline the (macro) steps you take to do the import... I think we might be able to make a couple of suggestions there, as well.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 1, 2008 at 10:26 am
p.s. The steps you take to get the raw data file from wherever should also be included. Of course, no username/passwords, please.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 1, 2008 at 10:38 am
Lowell (4/1/2008)
boy I really agree; It's a stupid Project requirement that all "raw data" is imported from text into SQL Server database, and then the data massaged from there.The theory is If I got struck by lightning, anyone could re-run any step in the scripts again to duplicate the work i'm doing, and not need my machine and specific paths to the text files... in case something wasn't right or needed adaptation.
which really makes no sense, since the script to bulk insert the raw data is path specific anyway; i had to bulk import it on a tilde ~ separator that I knew didn't exist in the data, instead of the pipe character.
I guess I don't have the same definition of "massage" (I know - that's sounds a bit like Bill Clinton...:hehe: ) as your project folks. I wouldn't consider parsing things into separate fields "massaging". Now - data cleanup, reformatting so that you can cast it - THAT's "massaging" in my book.
Where are you firing the script from? how are you firing it? there are rather straightforward ways to fire off such scripts with the UNC path as a variable. If you did it in DTS/SSIS, you could also be doing in under credentials that would have access to that path, which also would help with the external security.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
April 1, 2008 at 11:25 am
To add to what Matt just said... if the server logs in as a domain super-user that can see all the machines and at least some "shares", using a little Dynamic SQL in conjuction with BULK INSERT will blow the doors off of any other type of import. If you have the luxury of being able to use BCP, BCP will find bad rows and store them in a "bad file" so you can repair them later without having the main import fail. It's not quite as fast as BULK INSERT but is still way faster than DTS. Rumor has it that SSIS also has a "Bulk Insert" method but I don't know if it'll check for bad rows "auto-magically".
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply