T-SQL split string and Pivot

Question

T-SQL split string and Pivot

aaron.reese

SSChampion

Points: 13498
More actions
December 10, 2015 at 8:33 am

#399450

Hi Guys,
Suffering from brain fade today.
This is a string splitting problem with knobs on. There are plenty of string split posts out there, but I couldn't find anything that fits this particular scenario
Some bright spark thought is would be a good idea to hold Parent-child relationships in a single table as a delimited string.:-D
I need to iterate through the Field1 and Field2 records, splitting on pipe (|) and pivot them with a seq_ID to represent the position in the string
Each record will have a different number of delimited items
Each record will have the SAME number of delimited items in Field1 and Field2
So What I want to see in #DEMO_TARGET is
24764, 1, 312, 97
24764, 2, 303, 110
...
24764, 18, 313, 17
24765, 1, 312, 97
...
24765,15,313,17
My first though was to run each field through a splitter and create an interim table (AnchorRef, Seq_ID, FieldName, Field Value) and effectively create an EAV style table and then Join them back up on a PIVOT FieldName on AnchorRef and Seq_ID, but I have 350K records to split which will result in the region of 3M EAV records and it just feels like a code-smell.
Does anyone have any other bright ideas
DML
CREATE TABLE #DEMO_SOURCE
(
AnchorRef INT,
Field1 NVARCHAR(MAX),
Field2 NVARCHAR(MAX)
)
INSERT INTO #DEMO_SOURCE (24764,'312|303|323|305|308|535|311|327|329|553|557|558|324|580|320|322|321|313','97|110|268|364|18|658|242|257|42|290|443|128|3|166|256|18|20|17')
INSERT INTO #DEMO_SOURCE (24765,'312|303|323|535|327|329|553|557|558|324|580|320|322|321|313', '97|110|268|658|257|42|290|443|128|3|166|256|18|20|17')
CREATE TABLE #DEMO_TARGET
(
AnchorRef INT,
SeqID INT,
Field1 NVARCHAR(100),
Field2 NVARCHAR(100)
)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Luis Cazares SSC Guru Points: 183706 More actions · Answer 1

I'm not sure why do you want a pivot.

Note that this code will only work with up to 4000 chars. If you don't have unicode values, you could use the DelimitedSplit8k to double the length. You can find both functions in here: http://www.sqlservercentral.com/articles/Tally+Table/72993/

SELECT AnchorRef,

s1.ItemNumber AS SeqID,

s1.Item AS Field1,

s2.Item AS Field2

FROM #DEMO_SOURCE d

CROSS APPLY dbo.DelimitedSplitN4K( Field1, '|') s1

CROSS APPLY dbo.DelimitedSplitN4K( Field2, '|') s2

WHERE s1.ItemNumber = s2.ItemNumber;

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

aaron.reese SSChampion Points: 13498 More actions · Answer 2

Thanks Luis,

That is a partial solution but missed the point of needing to match the positional elements of each split as it will produce a cartesian join. If I had 10 elements in the strings, then I would end up with 100 records. In the real world, I have four fields to split with up to 20 elements so actually could end up with 20^4 = 160,000 rows instead of 20.

I used a slightly modified version of a splitter I found on another site. By using the ROW_NUMBER I can then include the sequence in the where clause

Credit to here

http://sqlperformance.com/2012/07/t-sql-queries/split-strings(This is a contentious post, Jeff Moden, Lynn Pettis and Brent Ozar all weigh in to the author in the comments section - Very OT but is makes for a good read on nettiquete for cross posting work)

Alter FUNCTION dbo.SplitStrings_XML

(

@List NVARCHAR(MAX),

@Delimiter NVARCHAR(255)

)

RETURNS TABLE

WITH SCHEMABINDING

AS

RETURN

(

SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)'),

sequence = ROW_NUMBER() OVER(ORDER BY y.i)

FROM

(

SELECT x = CONVERT(XML, '<i>'

+ REPLACE(@List, @Delimiter, '</i><i>')

+ '</i>').query('.')

) AS a CROSS APPLY x.nodes('i') AS y(i)

);

GO

And then the following query to actually run the data

SELECT DS.AnchorRef,field1.Item AS 'Field1',Field1.sequence AS 'F1SEQ', Field2.Item AS 'Field2' , Field2.sequence AS 'F2Seq'FROM #DEMO_SOURCE AS DS

CROSS APPLY dbo.SplitStrings_XML(DS.field1, '|') AS field1

CROSS APPLY dbo.SplitStrings_XML(DS.field2, '|') AS field2

WHERE

field1.Sequence = field2.sequence

In the wild my source table sits on an underpowered linked server with 340K records and the target is being run on another underpowered dev server. I am matching on four columns and it returns 709K rows in under 30 seconds which is acceptable for my purposes. When I exclude the WHERE clause the cartesian joins result in over 20M rows (and this takes a bit longer 😀 )

Luis Cazares SSC Guru Points: 183706 More actions · Answer 3

Other than using a different function, the final code is basically the same. The only problem is that you're not using the fastest function available (even in the article Aaron mentions he wouldn't use it in production environments). The only valid reason that I see to use that version is because it handles nvarchar(max), which you really need to be sure that you need it.

That being said, performance problems might come from other places if I understood correctly your last paragraph.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 4

Luis Cazares (12/11/2015)
Other than using a different function, the final code is basically the same. The only problem is that you're not using the fastest function available (even in the article Aaron mentions he wouldn't use it in production environments). The only valid reason that I see to use that version is because it handles nvarchar(max), which you really need to be sure that you need it.
That being said, performance problems might come from other places if I understood correctly your last paragraph.

Oddly enough, I'm actually writing an article on all of the problems with the serious testing problems used on that particular article that makes everyone think that DelimitedSplitN4k wasn't the fastest... and I'm not talking about just a little bit, either.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 5

@Aaron.reese,

You have these 4 fields that contain hierarchical pipe delimited data and you've realized that's not the best structure for querying (it's called a "Hierarchical Path" or "Sort Path").

The question is, what do you want to do with the data once it's split up? How do you intend to use that data? I ask because I think I can help a whole lot but want to make sure I'm not barking up the wrong tree to start with.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

aaron.reese SSChampion Points: 13498 More actions · Answer 6

Thanks Jeff,

I will be comparing a subset of the output to another data set to carry out reconciliations: missing records from both sets and discrepancies in values.

The business scenario is that I have two separate processes that can select group of records for further processing, but the rules about how the records are selected are different in each process. I am trying to reconcile the two extracts and create a matrix report that shows how many records for each extract from system 1 appear in which extract from system 2

The report will be produced in SSRS but I want to do the heavy lifting in the database as I also need to use the information to initiate the next stage in the process.

S2.1 S2.2 S2.3 S2.4 missing

S1.1 108 1234 85 8 32

S1.2 0 89 15 0 18

S1.3 58 0 65 3 7

missing 35 457 0 89

the sample above will tell me that I still need some additional extracts from both systems (i.e. not all records have been allocated to a group)

I can go into specifics if necessary, but I prefer to keep posts on the forum generic where possible so that others can apply the discussion to their situation.