February 8, 2005 at 7:59 pm
I am converting some denormalized data to a normalized structure. In the process I've run into a situation where multiple values exist in a single column (separated by spaces) and I need to convert that single column into multiple rows of data. Now multiply that problem by 50 columns by 50k source rows.
The solution I came up with requires cursors.. *waits for the boos to die down* .. I'd love to simplify this to not require cursors at all, and many people have suggested in other threads that cursors are the devil and similar sentiments.
Here's your chance! Below is a test script that illustrates what I'm trying to do along with the user-function + cursor solution and proper output. It works just fine with cursors and as this is a one-time conversion the performance is plenty fast enough. Still, I'm always up for learning a better way, so feel free to suggest improvements, and thanks for any input!
CREATE FUNCTION split_multi_column(@id varchar(16), @list varchar(8000), @delimiter varchar(1) = ',')
RETURNS @tbl TABLE(
my_id varchar(16),
my_str varchar(80)) AS BEGIN
/*
split_multi_column
Author: JT Lovell
Based on functions written by Erland Sommarskog, SQL Server MVP from http://www.sommarskog.se/arrays-in-sql.html
*/
DECLARE
@pos int,
@textpos int,
@chunklen smallint,
@tmpstr varchar(8000),
@leftover varchar(8000),
@tmpval varchar(8000)
SET @textpos = 1
SET @leftover = ''
WHILE @textpos <= datalength(@list) BEGIN
SET @chunklen = 4000 - datalength(@leftover)
SET @tmpstr = @leftover + substring(@list, @textpos, @chunklen)
SET @textpos = @textpos + @chunklen
SET @pos = charindex(@delimiter, @tmpstr)
WHILE @pos > 0 BEGIN
SET @tmpval = ltrim(rtrim(left(@tmpstr, @pos - 1)))
INSERT @tbl(my_id, my_str) VALUES(@id, @tmpval)
SET @tmpstr = substring(@tmpstr, @pos + 1, len(@tmpstr))
SET @pos = charindex(@delimiter, @tmpstr)
END
SET @leftover = @tmpstr
END
INSERT @tbl(my_id, my_str) VALUES (@id, ltrim(rtrim(@leftover)))
RETURN
END
GO
CREATE TABLE #source(
source_id varchar(16),
source_text varchar(80)
)
INSERT INTO #source(source_id,source_text)
VALUES('1','Alpha Bravo Charlie Delta Echo')
INSERT INTO #source(source_id,source_text)
VALUES('2','Fox Golf Hotel India Juliet')
CREATE TABLE #target(
target_id integer IDENTITY(1,1) NOT NULL,
source_id varchar(16),
target_type varchar(20),
target_text varchar(80)
)
DECLARE @pk varchar(16)
DECLARE @txt varchar(80)
/* normalize source data */
DECLARE cur_row CURSOR FAST_FORWARD
FOR
SELECT s.source_id, s.source_text
FROM #source s
OPEN cur_row
FETCH NEXT FROM cur_row INTO @pk, @txt
/* loop through each source row */
WHILE (@@FETCH_STATUS=0) BEGIN
INSERT INTO #target(source_id, target_type, target_text)
SELECT smc.my_id, 'Test', smc.my_str
FROM split_multi_column(@pk, @txt, ' ') smc
/* get next source row */
FETCH NEXT FROM cur_row INTO @pk, @txt
END /* cursor loop */
/* close and deallocate cursor */
CLOSE cur_row
DEALLOCATE cur_row
/* this output is exactly what I am looking for */
SELECT * from #target
/*
output results:
target_id source_id target_type target_text
----------- ---------------- -------------------- --------------------------------------------------------------------------------
1 1 Test Alpha
2 1 Test Bravo
3 1 Test Charlie
4 1 Test Delta
5 1 Test Echo
6 2 Test Fox
7 2 Test Golf
8 2 Test Hotel
9 2 Test India
10 2 Test Juliet
*/
February 9, 2005 at 2:41 am
Hi,
You can accomplish the loop without cursors:-
------------------------------------------------------
CREATE TABLE #source(
source_id varchar(16),
source_text varchar(80)
)
INSERT INTO #source(source_id,source_text)
VALUES('1','Alpha Bravo Charlie Delta Echo')
INSERT INTO #source(source_id,source_text)
VALUES('2','Fox Golf Hotel India Juliet')
CREATE TABLE #target(
target_id integer IDENTITY(1,1) NOT NULL,
source_id varchar(16),
target_type varchar(20),
target_text varchar(80)
)
DECLARE @txt varchar(80)
DECLARE @MinPK varchar(16)
DECLARE @MaxPK varchar(16)
SET @MinPK = (SELECT MIN(s.source_id) FROM #source s)
SET @MaxPK = (SELECT MAX(s.source_id) FROM #source s)
/* loop through each source row */
WHILE @MinPK <= @MaxPK
BEGIN
SET @txt = (SELECT s.source_text FROM #source s WHERE s.source_id = @MinPK)
INSERT INTO #target(source_id, target_type, target_text)
SELECT smc.my_id, 'Test', smc.my_str
FROM split_multi_column(@MinPK, @txt, ' ') smc
/* get next source row */
SET @MinPK = (SELECT MIN(s.source_id) FROM #source s WHERE s.source_id > @MinPK)
END /* loop */
/* this output is exactly what I am looking for */
SELECT * from #target
drop table #source
drop table #target
------------------------------------------------------
Gives exactly the same result.
The pros and cons of this approach would have to be explained by someone on a higher plane than me though
Have fun
Steve
We need men who can dream of things that never were.
February 9, 2005 at 2:53 am
If you use a user defined function in this way then I believe that SQL will use an implicit cursor because the function has to work on a row by row basis anyway.
February 9, 2005 at 8:38 am
Any other non-cursor method seems to be an academic exercise. And looping around a select statement is a poor substitute for a cursor, though it may not use as many resources, I wouldn't expect it to perform very well.
Here's a general idea of how to do it with a set based operation if the values in the combined text fields are from a known, limited set.
You could do INSERTs with values from a JOIN using LIKE:
INSERT INTO newtable SELECT ...,'KNOWN',... JOIN ... ON a.combined_value like '%KNOWN%';
INSERT INTO newtable SELECT ...,'KNOWN2',... JOIN ... ON a.combined_value like '%KNOWN2%';
INSERT INTO newtable SELECT ...,'KNOWN3',... JOIN ... ON a.combined_value like '%KNOWN3%';
February 9, 2005 at 8:45 am
The looping around the a select statement works quite well because the locks are only held for a short period of time.
February 9, 2005 at 9:09 am
Thanks for the replies! I have a few questions though:
Steve>The while loop versus a cursor operates in nearly the same way as the cursor. Is it really more efficient or just an alternate approach?
Peter> I take a similar approach for the columns that have a known set. The ones I'm doing via cursors could be anything though.
Thanks again!
February 9, 2005 at 10:06 am
Unfortunately, I didn't really answer the original question - it is just to give you an alternative approach.
I am fairly sure they are very similar in terms of efficiency in this case.
There are other cases where you can really improve your efficiency by reducing the number of selects / database calls within the loop. Depends on circumstances.... and - as David says, the Lock time is very minimal.
Peter is closest to answering the 'set based approach question', just a pity he didn't code the answer fully...
Have fun
Steve
We need men who can dream of things that never were.
February 9, 2005 at 10:12 am
Sorry. I would have but it seems that it was not pertinent to the original post (see JT's last post). Also, I'm neck deep in mapping a government housing inspection business process into the finite-state machine of a particular CRM system. My brain is turning to mud.
February 9, 2005 at 3:50 pm
JT
if looping is performing an expensive select, cursors are a great alternative. locking can exists while you walkthrough the cursor but if it allows you to take one pass only to the source table then things could get pretty interesting because those are the rare cases when the cursors could outperform the looping.
HTH
* Noel
February 9, 2005 at 5:52 pm
Noel,
That's refreshing to hear! I think too many people throw out the baby with the bath water when it comes to cursors. They have their place as long as they aren't being used where set-based approaches are better.
Do cursors still lock the source rows when using a fast_forward (read-only, forward-only) cursor? I was under the impression that it didn't lock unless the FOR UPDATE keyword is used. I may be mistaken though.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply