March 13, 2013 at 5:33 am
Hello,
I am in need of some help with an SSIS package (SQL 2008). I am trying to extract data out of a progress database and provide a reporting database for our users.
The area I am having some trouble is with some tables which have columns where data is stored and split by a ‘;’ . This data in the system actually represents different fields in the application so the end game is to have these in their own columns for people to select from.
Example of table:
GroupNumber Group_id GroupName AddressLine1 AddressLine2 AddressLine3 Postcode DataColumn
Group1 0001 TestGroup Capital letter Ltd Test Street Test Town XX1 22A 00-00-00;112115;TestBank
GROUP2 0002 TestGroup2 Bad letter Ltd Test2 Street Test2 Town XX2 11B 11-00-55;522445;TestBank2
The way the data needs to end up is to have the data column split out into:
Sortcode AccountNumber BankName
00-00-00 112115 TestBank
11-00-55 522445 TestBank2
The issue I am having is performance of the methods I have tried so far: the DataColum has up to 200 ‘columns’ within it. This is fine over a few records <100 but when it performs across >40,000 then I am starting to notice some serious performance problems results are taking as long as 8 minutes. If I only split 10 out of 100 columns it performs much better over 40,000 rows than splitting them all.
Example data:
CREATE TABLE GroupTable
(
GroupNumber VARCHAR(6) NULL
,Group_ID INT
,GroupName VARCHAR(10) NULL
,AddressLine1 VARCHAR(50) NULL
,AddressLine2 VARCHAR(50) NULL
,AddressLine3 VARCHAR(50) NULL
,Postcode VARCHAR(8) NULL
,Datacolumn VARCHAR(300)
);
INSERT INTO GroupTable (GroupNumber,Group_ID,GroupName,AddressLine1,AddressLine2,AddressLine3,Postcode,datacolumn)
VALUES
('Group1',1,'testgroup','Capital letter Ltd','1 Test Street','Test Town','XX1 22A','00-00-00;112115;TestBank'),
('Group2',2,'testgroup2','Low letter Ltd','2 Test2 Street','Test2 Town','XX2 11B','11-00-55;522445;TestBank2'),
('Group1',1,'testgroup','Capital letter Ltd','1 Test Street','Test Town','XX1 22A','10-10-10;112115;TestBank'),
('Group2',2,'testgroup2','Low letter Ltd','2 Test2 Street','Test2 Town','XX2 11B','11-00-00;522445;TestBank2')
I originally started using the following function and it was working ok until I got to pull larger amounts of data (and adding in more columns to split): (this was in an old post on these forums)
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER FUNCTION [dbo].[FromProgressArray] (@array NVARCHAR(4000), @index INT)
RETURNS NVARCHAR(256)
AS
BEGIN
IF @index = 0
RETURN( LEFT(@array, CHARINDEX(';', @array) - 1) )
DECLARE @counter INT
SELECT @counter = 0
WHILE @counter +1 < @index
BEGIN
IF (CHARINDEX(';', @array) <> 0)
SELECT @array = SUBSTRING(@array, CHARINDEX(';', @array) + 1, LEN(@array))
ELSE
SELECT @array = ''
SELECT @counter = @counter + 1
END
IF CHARINDEX(';', @array) <> 0
SELECT @array = LEFT(@array, CHARINDEX(';', @array) - 1)
RETURN( @array )
END
Example query:
SELECT
GroupNumber
,Group_ID
,GroupName
,AddressLine1
,AddressLine2
,AddressLine3
,Postcode
,dbo.FromProgressArray(datacolumn,1) AS sortcode
,dbo.FromProgressArray(datacolumn,2) AS AccountNumber
,dbo.FromProgressArray(datacolumn,3) AS BankName
FROM
dbo.GroupTable
I have also tried to use the tally table method mentioned by Jeff Moden: http://www.sqlservercentral.com/articles/T-SQL/62867/
Although this performs 50% quicker I have a problem here transposing the rows into columns.
I was using the function DelimitedSplit8K
Example query:
SELECT
GroupNumber
,GroupName
,AddressLine1
,AddressLine2
,AddressLine3
,Postcode
,MAX(CASE WHEN dsk.ItemNumber = 1 THEN dsk.Item END) AS SortCode
,MAX(CASE WHEN dsk.ItemNumber = 2 THEN dsk.Item END) AS AccountNumber
,MAX(CASE WHEN dsk.ItemNumber = 3 THEN dsk.Item END) AS BankName
FROM
GroupTable
CROSS APPLY dbo.DelimitedSplit8K(datacolumn, ';') dsk
GROUP BY
GroupNumber
,GroupName
,AddressLine1
,AddressLine2
,AddressLine3
,Postcode
The problem here is that the group record can have multiple entries in the table and the only way to distinguish the differences is once the datacolumn has been split out. This causes problems when using the above method as the grouping messes it up.
As you can see only two results are shown not the 4 that should be. If you add in dsk.itemnumber to the grouping the case ceases to work. I will also point out that the datacolumn can have no data between the ; .In the event it finds one it should add in a blank or NULL entry.
I have also looked into using the derived column feature but am not sure about the best way of doing this and haven’t found that much when using search engines (I maybe using the wrong keywords though).
Could someone tell me which way should be more efficient? Using a function or derived column in SSIS and assist with getting me on the right road so to say?
Many Thanks
March 13, 2013 at 8:32 am
OK so I have actually got this to a reasonable preformance time.
Using the delimetedsplit8k works great. this issue was with my query:
If you use a derrived table to like below you can produce it for the number of columns you need and the correct number of rows:
SELECT
GroupNumber
,GroupName
,AddressLine1
,AddressLine2
,AddressLine3
,Postcode
,dsk.item AS sortcode
,dsk2.item AS AccountNumber
,dsk3.item AS BankName
FROM
GroupTable
CROSS APPLY (SELECT item FROM dbo.DelimitedSplit8K(datacolumn, ';' ) WHERE dbo.DelimitedSplit8K.ItemNumber = 1) dsk
CROSS APPLY (SELECT item FROM dbo.DelimitedSplit8K(datacolumn, ';' ) WHERE dbo.DelimitedSplit8K.ItemNumber = 2) dsk2
CROSS APPLY (SELECT item FROM dbo.DelimitedSplit8K(datacolumn, ';' ) WHERE dbo.DelimitedSplit8K.ItemNumber = 3) dsk3
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply