Splitting Table Column on Delimiter

Question

Splitting Table Column on Delimiter

Lowell

SSC Guru

Points: 323755
More actions
March 31, 2008 at 9:35 am

#188096

I would swear I've done this before, but I'm drawing a blank today.
I have a table with lots of rows with a column of data that contains 6 elements delimited by the pipe character.
Instead of one field, i need the table split by the delimiter. I think the solution invoves a sub select, but I'll be darned if I remember how today.
Here's some example data, and what i was getting using a tally table.
I'm looking for results similar to this:
[font="Courier New"] col1 col2 ol3 col4 col5 col6
1000000 Timmerman Pond Dam Y NULL NULL U.S. Army Corps blah blah.....[/font]
[font="Courier New"]
--tally table if needed:
--===== Create and populate the Tally table on the fly
SELECT TOP 11000 --equates to more than 30 years of dates
  IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM MASTER.dbo.SysColumns sc1,        MASTER.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
  ADD CONSTRAINT PK_Tally_N         PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
CREATE TABLE rawdata(rawinput VARCHAR(8000))
INSERT INTO rawdata
SELECT '1000000|Timmerman Pond Dam|Y|||U.S. Army Corps of Engineers.  Dams ....' UNION
SELECT '1000001|Thomas Lake|Y|||U.S. Army Corps of Engineers.  Dams and Reservoirs List,  Washington, DC, 1981. ' UNION
SELECT '1000002|Thomas Lake Dam|Y|||U.S. Army Corps of Engineers.  Dams and Reservoirs List,  Washington, DC, 1981. ' UNION
SELECT '1000003|Walsh Fishing Lake|Y|||U.S. Army Corps of Engineers.  Dams and Reservoirs List,  Washington, DC, 1981. ' UNION
SELECT '1000004|Walsh Fishing Lake Dam|Y|||U.S. Army Corps of Engineers.  Dams and Reservoirs List,  Washington, DC, 1981. '
--assuming you have a tally table, this returns each element as a row, but I'm looking for a table with 5 columns
SELECT   t.n,
SUBSTRING(yt.rawinput+'|', t.n,CHARINDEX('|', yt.rawinput+'|', t.n) - t.n)
FROM rawdata AS yt,        dbo.Tally t
WHERE t.n <= LEN(yt.rawinput)
AND SUBSTRING('|' + yt.rawinput,t.n, 1) = '|'
AND
DATALENGTH(rawinput)  - DATALENGTH(REPLACE(rawinput,'|','')) = 5[/font]
Lowell
--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

Probably the simplest solution is to use Pivot after you break it up into rows. Otherwise, it's five sub-queries, one for each column.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 2

if you can be sure your data has no < symbols (or can live with a character swap), the following approach will work. you can use any function that splits a string into a table set.

-- function to split list by delimiter

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

alter function [dbo].[fListToText]( @list varchar(max), @delim varchar(max) = ',')

returns @returnTable table

( item varchar(255) not null )

as begin

declare @xml XML

set @xml = '<item>' + REPLACE(@list, @delim, '</item><item>') + '</item>'

insert into @returnTable

SELECT data.item.value('.','varchar(255)')

FROM @xml.nodes('//item') as data(item)

return

end

go

select rawinput,

min(case when seq % 6 = 0 then item else null end) as field1,

min(case when seq % 6 = 1 then item else null end) as field2,

min(case when seq % 6 = 2 then item else null end) as field3,

min(case when seq % 6 = 3 then item else null end) as field4,

min(case when seq % 6 = 4 then item else null end) as field5,

min(case when seq % 6 = 5 then item else null end) as field6

from

(

select rawinput, item, row_number() over ( order by rawinput ) as seq

from rawdata cross apply dbo.fListToText( R.rawinput, '|' ) ) as data

group by rawinput

Lowell SSC Guru Points: 323755 More actions · Answer 3

that's what i was looking for antonio; i remember using a different kind of split function for the same type of results. it's the cross apply and the row_number() over that i forgot about using. too much sql 2000, not enough 2005.

Thank you!

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 4

CROSS APPLY = CORRELATED SUBQUERY

CORRELATED SUBQUERY = RBAR

RBAR + UDF + TABLE VARIABLE = S....L....O....W

Heh... Lowell!!! :blink: You of all people should know that the Tally Table method will blow the doors off every other split there is! Just on 5 rows, the XML UDF method takes about 7 times longer than the Tally Table... try this out, buddy... it also happens to give you the exact result set you asked for. No extra columns... no data out of order...

CREATE TABLE rawdata(rawinput VARCHAR(8000))

INSERT INTO rawdata

SELECT '1000000|Timmerman Pond Dam|Y|||U.S. Army Corps of Engineers. Dams ....' UNION

SELECT '1000001|Thomas Lake|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION

SELECT '1000002|Thomas Lake Dam|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION

SELECT '1000003|Walsh Fishing Lake|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION

SELECT '1000004|Walsh Fishing Lake Dam|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. '

;WITH cteMySplit AS

(

SELECT DENSE_RANK() OVER (ORDER BY RawInput) AS RowNum,

ROW_NUMBER() OVER (PARTITION BY RawInput ORDER BY RawInput)-1 AS ColNum,

FROM dbo.Tally t,

dbo.RawData h

WHERE SUBSTRING('|'+h.RawInput+'|', t.N, 1) = '|'

AND t.N < LEN('|'+h.RawInput)

)

SELECT MAX(CASE WHEN ColNum % 6 = 0 THEN String ELSE NULL END) AS Col1,

MAX(CASE WHEN ColNum % 6 = 1 THEN String ELSE NULL END) AS Col2,

MAX(CASE WHEN ColNum % 6 = 2 THEN String ELSE NULL END) AS Col3,

MAX(CASE WHEN ColNum % 6 = 3 THEN String ELSE NULL END) AS Col4,

MAX(CASE WHEN ColNum % 6 = 4 THEN String ELSE NULL END) AS Col5,

MAX(CASE WHEN ColNum % 6 = 5 THEN String ELSE NULL END) AS Col6

FROM cteMySplit

GROUP BY RowNum

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lowell SSC Guru Points: 323755 More actions · Answer 5

I'm putting this thru it's paces now Jeff;

The data I'm trying to parse is some FIP55 data with every place name in the united states; 560 meg of data, so the sql is taking a while to run(+10 minutes on a dev server) ; I couldn't get my mental arms around this one, thanks.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 6

why not just create a 'real' table, bcp out the rawdata and then bcp that back in with a pipe (|) delimiter to the 'real' table?

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 7

Spot on Antonio... real key is that the data should never have come into the database in an unsplit fashion to begin with. They should use Bulk Insert or BCP to load the raw data.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 8

Lowell (4/1/2008)
I'm putting this thru it's paces now Jeff;
The data I'm trying to parse is some FIP55 data with every place name in the united states; 560 meg of data, so the sql is taking a while to run(+10 minutes on a dev server) ; I couldn't get my mental arms around this one, thanks.

Then, I agree with Antonio... the data should not be in this condition in the database to begin with... the data should be loaded from a file using Bulk Insert or BCP using "|" as the delimiter. Bulk Insert will load 5.1 million 20 column rows in 60 seconds.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lowell SSC Guru Points: 323755 More actions · Answer 9

boy I really agree; It's a stupid Project requirement that all "raw data" is imported from text into SQL Server database, and then the data massaged from there.

The theory is If I got struck by lightning, anyone could re-run any step in the scripts again to duplicate the work i'm doing, and not need my machine and specific paths to the text files... in case something wasn't right or needed adaptation.

which really makes no sense, since the script to bulk insert the raw data is path specific anyway; i had to bulk import it on a tilde ~ separator that I knew didn't exist in the data, instead of the pipe character.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 10

then you should be able to import the same file into both the "rawdata" table and the 'real/parsed' table in the same load script. (you'll need a seperate fmt file to skip fields and handle different field terminators.)

if the load scripts can't be modified, you can write an extended stored proc that determines where the temp path is, exports "rawdata" to it, then calls bcp to import that file, then deletes it.

convoluted? yes. but that's the problem with stupid 'requirements'. don't get me started on that subject again.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 11

Lowell (4/1/2008)
boy I really agree; It's a stupid Project requirement that all "raw data" is imported from text into SQL Server database, and then the data massaged from there.
The theory is If I got struck by lightning, anyone could re-run any step in the scripts again to duplicate the work i'm doing, and not need my machine and specific paths to the text files... in case something wasn't right or needed adaptation.
which really makes no sense, since the script to bulk insert the raw data is path specific anyway; i had to bulk import it on a tilde ~ separator that I knew didn't exist in the data, instead of the pipe character.

Lowell... Can you attach the first, say, 100 rows or so as a file to your next post? Dunno about Antonnio, but I wouldn't mind taking a crack at it.

Also, can you outline the (macro) steps you take to do the import... I think we might be able to make a couple of suggestions there, as well.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 12

p.s. The steps you take to get the raw data file from wherever should also be included. Of course, no username/passwords, please.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 13

Lowell (4/1/2008)
boy I really agree; It's a stupid Project requirement that all "raw data" is imported from text into SQL Server database, and then the data massaged from there.
The theory is If I got struck by lightning, anyone could re-run any step in the scripts again to duplicate the work i'm doing, and not need my machine and specific paths to the text files... in case something wasn't right or needed adaptation.
which really makes no sense, since the script to bulk insert the raw data is path specific anyway; i had to bulk import it on a tilde ~ separator that I knew didn't exist in the data, instead of the pipe character.

I guess I don't have the same definition of "massage" (I know - that's sounds a bit like Bill Clinton...:hehe: ) as your project folks. I wouldn't consider parsing things into separate fields "massaging". Now - data cleanup, reformatting so that you can cast it - THAT's "massaging" in my book.

Where are you firing the script from? how are you firing it? there are rather straightforward ways to fire off such scripts with the UNC path as a variable. If you did it in DTS/SSIS, you could also be doing in under credentials that would have access to that path, which also would help with the external security.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 14

To add to what Matt just said... if the server logs in as a domain super-user that can see all the machines and at least some "shares", using a little Dynamic SQL in conjuction with BULK INSERT will blow the doors off of any other type of import. If you have the luxury of being able to use BCP, BCP will find bad rows and store them in a "bad file" so you can repair them later without having the main import fail. It's not quite as fast as BULK INSERT but is still way faster than DTS. Rumor has it that SSIS also has a "Bulk Insert" method but I don't know if it'll check for bad rows "auto-magically".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)