December 10, 2015 at 8:33 am
Hi Guys,
Suffering from brain fade today.
This is a string splitting problem with knobs on. There are plenty of string split posts out there, but I couldn't find anything that fits this particular scenario
Some bright spark thought is would be a good idea to hold Parent-child relationships in a single table as a delimited string.:-D
I need to iterate through the Field1 and Field2 records, splitting on pipe (|) and pivot them with a seq_ID to represent the position in the string
Each record will have a different number of delimited items
Each record will have the SAME number of delimited items in Field1 and Field2
So What I want to see in #DEMO_TARGET is
24764, 1, 312, 97
24764, 2, 303, 110
...
24764, 18, 313, 17
24765, 1, 312, 97
...
24765,15,313,17
My first though was to run each field through a splitter and create an interim table (AnchorRef, Seq_ID, FieldName, Field Value) and effectively create an EAV style table and then Join them back up on a PIVOT FieldName on AnchorRef and Seq_ID, but I have 350K records to split which will result in the region of 3M EAV records and it just feels like a code-smell.
Does anyone have any other bright ideas
DML
CREATE TABLE #DEMO_SOURCE
(
AnchorRef INT,
Field1 NVARCHAR(MAX),
Field2 NVARCHAR(MAX)
)
INSERT INTO #DEMO_SOURCE (24764,'312|303|323|305|308|535|311|327|329|553|557|558|324|580|320|322|321|313','97|110|268|364|18|658|242|257|42|290|443|128|3|166|256|18|20|17')
INSERT INTO #DEMO_SOURCE (24765,'312|303|323|535|327|329|553|557|558|324|580|320|322|321|313', '97|110|268|658|257|42|290|443|128|3|166|256|18|20|17')
CREATE TABLE #DEMO_TARGET
(
AnchorRef INT,
SeqID INT,
Field1 NVARCHAR(100),
Field2 NVARCHAR(100)
)
December 10, 2015 at 8:52 am
I'm not sure why do you want a pivot.
Note that this code will only work with up to 4000 chars. If you don't have unicode values, you could use the DelimitedSplit8k to double the length. You can find both functions in here: http://www.sqlservercentral.com/articles/Tally+Table/72993/
SELECT AnchorRef,
s1.ItemNumber AS SeqID,
s1.Item AS Field1,
s2.Item AS Field2
FROM #DEMO_SOURCE d
CROSS APPLY dbo.DelimitedSplitN4K( Field1, '|') s1
CROSS APPLY dbo.DelimitedSplitN4K( Field2, '|') s2
WHERE s1.ItemNumber = s2.ItemNumber;
December 11, 2015 at 2:50 am
Thanks Luis,
That is a partial solution but missed the point of needing to match the positional elements of each split as it will produce a cartesian join. If I had 10 elements in the strings, then I would end up with 100 records. In the real world, I have four fields to split with up to 20 elements so actually could end up with 20^4 = 160,000 rows instead of 20.
I used a slightly modified version of a splitter I found on another site. By using the ROW_NUMBER I can then include the sequence in the where clause
Credit to here
http://sqlperformance.com/2012/07/t-sql-queries/split-strings(This is a contentious post, Jeff Moden, Lynn Pettis and Brent Ozar all weigh in to the author in the comments section - Very OT but is makes for a good read on nettiquete for cross posting work)
Alter FUNCTION dbo.SplitStrings_XML
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)'),
sequence = ROW_NUMBER() OVER(ORDER BY y.i)
FROM
(
SELECT x = CONVERT(XML, '<i>'
+ REPLACE(@List, @Delimiter, '</i><i>')
+ '</i>').query('.')
) AS a CROSS APPLY x.nodes('i') AS y(i)
);
GO
And then the following query to actually run the data
SELECT DS.AnchorRef,field1.Item AS 'Field1',Field1.sequence AS 'F1SEQ', Field2.Item AS 'Field2' , Field2.sequence AS 'F2Seq'FROM #DEMO_SOURCE AS DS
CROSS APPLY dbo.SplitStrings_XML(DS.field1, '|') AS field1
CROSS APPLY dbo.SplitStrings_XML(DS.field2, '|') AS field2
WHERE
field1.Sequence = field2.sequence
In the wild my source table sits on an underpowered linked server with 340K records and the target is being run on another underpowered dev server. I am matching on four columns and it returns 709K rows in under 30 seconds which is acceptable for my purposes. When I exclude the WHERE clause the cartesian joins result in over 20M rows (and this takes a bit longer 😀 )
December 11, 2015 at 8:11 am
Other than using a different function, the final code is basically the same. The only problem is that you're not using the fastest function available (even in the article Aaron mentions he wouldn't use it in production environments). The only valid reason that I see to use that version is because it handles nvarchar(max), which you really need to be sure that you need it.
That being said, performance problems might come from other places if I understood correctly your last paragraph.
December 12, 2015 at 6:14 pm
Luis Cazares (12/11/2015)
Other than using a different function, the final code is basically the same. The only problem is that you're not using the fastest function available (even in the article Aaron mentions he wouldn't use it in production environments). The only valid reason that I see to use that version is because it handles nvarchar(max), which you really need to be sure that you need it.That being said, performance problems might come from other places if I understood correctly your last paragraph.
Oddly enough, I'm actually writing an article on all of the problems with the serious testing problems used on that particular article that makes everyone think that DelimitedSplitN4k wasn't the fastest... and I'm not talking about just a little bit, either.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 12, 2015 at 6:23 pm
@Aaron.reese,
You have these 4 fields that contain hierarchical pipe delimited data and you've realized that's not the best structure for querying (it's called a "Hierarchical Path" or "Sort Path").
The question is, what do you want to do with the data once it's split up? How do you intend to use that data? I ask because I think I can help a whole lot but want to make sure I'm not barking up the wrong tree to start with.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 14, 2015 at 3:43 am
Thanks Jeff,
I will be comparing a subset of the output to another data set to carry out reconciliations: missing records from both sets and discrepancies in values.
The business scenario is that I have two separate processes that can select group of records for further processing, but the rules about how the records are selected are different in each process. I am trying to reconcile the two extracts and create a matrix report that shows how many records for each extract from system 1 appear in which extract from system 2
The report will be produced in SSRS but I want to do the heavy lifting in the database as I also need to use the information to initiate the next stage in the process.
S2.1 S2.2 S2.3 S2.4 missing
S1.1 108 1234 85 8 32
S1.2 0 89 15 0 18
S1.3 58 0 65 3 7
missing 35 457 0 89
the sample above will tell me that I still need some additional extracts from both systems (i.e. not all records have been allocated to a group)
I can go into specifics if necessary, but I prefer to keep posts on the forum generic where possible so that others can apply the discussion to their situation.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply