December 28, 2007 at 4:31 am
I have a file Data.txt . I would like to split the textfile into a tabular format as well as the its data.
Unfortunately the only tool available to me right now is sql server 2005. How can I achieve this ?
December 28, 2007 at 1:54 pm
SQL server 2005 offers CLR integration... so I have performed this kind of imports coding a stored procedure using C# that reads the text file and inserts the data to the correspondind table in SQL database.
Regards,
Felix
December 28, 2007 at 3:17 pm
That's nice... But the OP did say "Unfortunately the only tool available to me right now is sql server 2005. How can I achieve this ?" I'm pretty sure that means the C# based CLR's are excluded...
--Jeff Moden
Change is inevitable... Change for the better is not.
December 28, 2007 at 9:40 pm
Ray,
Just curious... obviously this information is produced by a database somewhere... is there no way to convince the data provider to provide the information either in a delimited format or an XML format?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 29, 2007 at 12:41 am
Actually... I've done it... I've got a few more pieces together, but it's all but done.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 3, 2008 at 9:48 pm
Ok... sorry for the delay... couple of things came up and I couldn't get back to this... a shame, really, because this was kind'a fun.
In order to split the information like you wanted, we first have to reassemble the rows according to the numbering embedded in your document. Once we have all the related rows in a single row, then the split is fairly easy.
The splits could be controlled by a "configuration table" but I didn't take the time to do that.
In order to reassemble the rows, we must first import them in a given order. In order to do that, we need an IDENTITY column on the import table (Raw1). In order to import data to a table that has an IDENTITY column, we also need a BCP format file so we can "skip" the IDENTITY column. See the attachment called "Raw1Format.txt". That file must live in the same folder as the raw data file you want to split. For the casual observer, here's what that file looks like...
8.0
2
1 SQLCHAR 0 0 "" 0 RowNum ""
2 SQLCHAR 0 8000 "\r" 2 RawData SQL_Latin1_General_CP1_CI_AS
The rest is history... I'll leave it up to you to insert the output from the following code into a table...
--=====================================================================================================================
-- Setup
--=====================================================================================================================
--===== Create the Raw1 working table
-- Used to congeal the data and place process markers
IF OBJECT_ID('TempDB..Raw1','U') IS NOT NULL
DROP TABLE Raw1
CREATE TABLE Raw1
(
RowNum INT IDENTITY(1,1),
RawData VARCHAR(200), --The raw data being imported
SectionNumber CHAR(2), --The section number (single number on row)
ItemType CHAR(1), --A=Individual, B=Entities, C=List
ItemNumber CHAR(10), --The numeric item number under ItemType
LastUpdated DATETIME, --From the "Last Updated On" rows
ItemMark AS SectionNumber+ItemType+ItemNumber
)
ALTER TABLE Raw1
ADD CONSTRAINT PK_Raw1_RowNum
PRIMARY KEY CLUSTERED (RowNUM)
--===== Create the Raw2 working table
-- Used to split the congealed the data
IF OBJECT_ID('TempDB..Raw2','U') IS NOT NULL
DROP TABLE Raw2
CREATE TABLE Raw2
(
RowNum INT IDENTITY(1,1),
ItemMark CHAR(13),
RawData VARCHAR(MAX),
[*Good quality a.k.a.:]VARCHAR(255)
)
ALTER TABLE Raw2
ADD CONSTRAINT PK_Raw2_RowNum
PRIMARY KEY CLUSTERED (RowNUM)
GO
--=====================================================================================================================
-- Import the raw data file
--=====================================================================================================================
--===== Import the raw data. RowNum will keep the order of rows
BULK INSERT Raw1
FROM 'D:\SQL Examples\Worlds Nastiest Split\Data.txt' --TODO: CHANGE PATH AND FILE NAME
WITH (
DATAFILETYPE = 'CHAR',
FORMATFILE = 'D:\SQL Examples\Worlds Nastiest Split\Raw1Format.txt', --TODO: CHANGE PATH ONLY
TABLOCK
)
--=====================================================================================================================
-- Find and mark the various sections indicated in the file
--=====================================================================================================================
--===== Find and mark "SectionNumber" rows
UPDATE Raw1
SET SectionNumber = RawData
FROM Raw1
WHERE RawData NOT LIKE '%[^0-9]%'
--===== Find and mark "ItemType" rows
UPDATE Raw1
SET ItemType = SUBSTRING(RawData,1,CHARINDEX('.',RawData)-1)
FROM Raw1
WHERE RawData LIKE '[A-Z]. _%'
--===== Find and mark "ItemNumber" rows
UPDATE Raw1
SET ItemNumber = SUBSTRING(RawData,1,CHARINDEX('.',RawData)-1)
FROM Raw1
WHERE RawData LIKE '[0-9]. *Name:%'
OR RawData LIKE '[0-9][0-9]. *Name:%'
OR RawData LIKE '[0-9][0-9][0-9]. *Name:%'
OR RawData LIKE '[0-9][0-9][0-9][0-9]. *Name:%'
--===== Find and mark "LastUpdated" rows
UPDATE Raw1
SET LastUpdated = SUBSTRING(RawData,17,200)
FROM Raw1
WHERE RawData LIKE 'Last Updated On %'
--=====================================================================================================================
-- Smear the various section data down through the rows to identify them.
-- This whole section is just to make it so the ItemMark can be used as the key to reassembling the related rows
-- of raw data so that we can parse them all the same way.
--=====================================================================================================================
--===== Declare the variables that will remember the data to be "smeared" down
DECLARE @SectionNumber CHAR(2)
DECLARE @ItemType CHAR(1)
DECLARE @ItemNumber CHAR(10)
DECLARE @LastUpdated DATETIME
DECLARE @RowNum INT --Used as an "anchor"
--===== Preset the variables to a known condition
SET @SectionNumber = 0
SET @ItemType = NULL
SET @ItemNumber = NULL
SET @LastUpdated = '19000101'
SET @RowNum = 0
--===== Smear the known section and other data down through the other rows to identify them
UPDATE Raw1
SET @ItemNumber = ItemNumber = STR(
CASE WHEN ItemNumber IS NULL
AND ItemType IS NULL
THEN @ItemNumber
ELSE ItemNumber
END
,5),
@ItemType = ItemType = CASE WHEN ItemType IS NULL
AND SectionNumber IS NULL
THEN @ItemType
ELSE ItemType
END,
@SectionNumber = SectionNumber = STR(
CASE WHEN SectionNumber IS NULL
THEN @SectionNumber
ELSE SectionNumber
END
,2),
@LastUpdated = LastUpdated = CASE WHEN LastUpdated IS NULL
THEN @LastUpdated
ELSE LastUpdated
END,
@RowNum = RowNum --Just an "anchor"
FROM Raw1 WITH (INDEX(PK_Raw1_RowNum),TABLOCKX)
WHERE RawData > '' --Is not blank and is not null
--===== Delete all rows we no longer need
DELETE Raw1
WHERE ItemNumber IS NULL
OR RawData LIKE 'Last Updated On %'
--=====================================================================================================================
-- Reassemble the individual rows by SectionNumber, ItemType, and ItemNumber... (Derived table "s")
-- ... then split the data from the derived table into the elements we want
--=====================================================================================================================
SELECT SectionNumber, ItemType, ItemNumber,
Name1 =
CASE WHEN RawData LIKE '%*Name: 1:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('*Name: 1:',RawData)+LEN('*Name: 1:'),
CHARINDEX('2: ',RawData)
-(CHARINDEX('*Name: 1:',RawData)+LEN('*Name: 1:')))))
WHEN RawData LIKE '%*Name: %*A.k.a.:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('*Name:',RawData)+LEN('*Name:'),
CHARINDEX('*A.k.a.:',RawData)
-(CHARINDEX('*Name:',RawData)+LEN('*Name:')))))
ELSE 'na'
END,
Name2 =
CASE WHEN RawData LIKE '%*Name: 1:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('2:',RawData)+LEN('2:'),
CHARINDEX('3: ',RawData)
-(CHARINDEX('2:',RawData)+LEN('2:')))))
ELSE 'na'
END,
Name3 =
CASE WHEN RawData LIKE '%*Name: 1:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('3:',RawData)+LEN('3:'),
CHARINDEX('4: ',RawData)
-(CHARINDEX('3:',RawData)+LEN('3:')))))
ELSE 'na'
END,
Name4 =
CASE WHEN RawData LIKE '%*Name: 1:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('4:',RawData)+LEN('4:'),
CHARINDEX('Title',RawData)
-(CHARINDEX('4:',RawData)+LEN('4:')))))
ELSE 'na'
END,
Title1 =
CASE WHEN RawData LIKE '%Title 1:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('Title 1:',RawData)+LEN('Title 1:'),
CHARINDEX('Title 2:',RawData)
-(CHARINDEX('Title 1:',RawData)+LEN('Title 1:')))))
WHEN RawData LIKE '%*Name: %Title:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('Title:',RawData)+LEN('Title:'),
CHARINDEX('Designation:',RawData)
-(CHARINDEX('Title:',RawData)+LEN('Title:')))))
ELSE 'na'
END,
Title2 =
CASE WHEN RawData LIKE '%Title 2:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('Title 2:',RawData)+LEN('Title 2:'),
CHARINDEX('Designation:',RawData)
-(CHARINDEX('Title 2:',RawData)+LEN('Title 2:')))))
ELSE 'na'
END,
Designation =
CASE WHEN RawData LIKE '%Designation:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('Designation:',RawData)+LEN('Designation:'),
CHARINDEX('DOB:',RawData)
-(CHARINDEX('Designation:',RawData)+LEN('Designation:')))))
ELSE 'na'
END,
DOB =
CASE WHEN RawData LIKE '%DOB:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('DOB:',RawData)+LEN('DOB:'),
CHARINDEX('POB:',RawData)
-(CHARINDEX('DOB:',RawData)+LEN('DOB:')))))
ELSE 'na'
END,
POB =
CASE WHEN RawData LIKE '%POB:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('POB:',RawData)+LEN('POB:'),
CHARINDEX('*Good quality a.k.a.:',RawData)
-(CHARINDEX('POB:',RawData)+LEN('POB:')))))
ELSE 'na'
END,
[*Good quality a.k.a.:] =
CASE WHEN RawData LIKE '%*Good quality a.k.a.:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('*Good quality a.k.a.:',RawData)+LEN('*Good quality a.k.a.:'),
CHARINDEX('Low quality a.k.a.:',RawData)
-(CHARINDEX('*Good quality a.k.a.:',RawData)+LEN('*Good quality a.k.a.:')))))
ELSE 'na'
END,
[Low quality a.k.a.:] =
CASE WHEN RawData LIKE '%Low quality a.k.a.:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('Low quality a.k.a.:',RawData)+LEN('Low quality a.k.a.:'),
CHARINDEX('*Nationality:',RawData)
-(CHARINDEX('Low quality a.k.a.:',RawData)+LEN('Low quality a.k.a.:')))))
ELSE 'na'
END,
[*A.k.a.:] =
CASE WHEN RawData LIKE '%*A.k.a.:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('*A.k.a.:',RawData)+LEN('*A.k.a.:'),
CHARINDEX('F.k.a.:',RawData)
-(CHARINDEX('*A.k.a.:',RawData)+LEN('*A.k.a.:')))))
ELSE 'na'
END,
[F.k.a.:] =
CASE WHEN RawData LIKE '%F.k.a.:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('F.k.a.:',RawData)+LEN('F.k.a.:'),
CHARINDEX('*Address:',RawData)
-(CHARINDEX('F.k.a.:',RawData)+LEN('F.k.a.:')))))
ELSE 'na'
END,
[*Nationality:] =
CASE WHEN RawData LIKE '%*Nationality:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('*Nationality:',RawData)+LEN('*Nationality:'),
CHARINDEX('Passport no.:',RawData)
-(CHARINDEX('*Nationality:',RawData)+LEN('*Nationality:')))))
ELSE 'na'
END,
[Passport no.:] =
CASE WHEN RawData LIKE '%Passport no.:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('Passport no.:',RawData)+LEN('Passport no.:'),
CHARINDEX('National identification no.:',RawData)
-(CHARINDEX('Passport no.:',RawData)+LEN('Passport no.:')))))
ELSE 'na'
END,
[National identification no.:] =
CASE WHEN RawData LIKE '%National identification no.:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('National identification no.:',RawData)+LEN('National identification no.:'),
CHARINDEX('Address:',RawData)
-(CHARINDEX('National identification no.:',RawData)+LEN('National identification no.:')))))
ELSE 'na'
END,
[Address:] =
CASE WHEN RawData LIKE '%*Address:%*Listed on:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('*Address:',RawData)+LEN('*Address:'),
CHARINDEX('*Listed on:',RawData)
-(CHARINDEX('*Address:',RawData)+LEN('*Address:')))))
WHEN RawData LIKE '%*Address:%Listed on:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('*Address:',RawData)+LEN('*Address:'),
CHARINDEX('Listed on:',RawData)
-(CHARINDEX('*Address:',RawData)+LEN('*Address:')))))
ELSE 'na'
END,
[*Listed on:] =
CASE WHEN RawData LIKE '%Listed on:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('*Listed on:',RawData)+LEN('*Listed on:'),
CHARINDEX('*Other information:',RawData)
-(CHARINDEX('*Listed on:',RawData)+LEN('*Listed on:')))))
ELSE 'na'
END,
[*Other information:] =
CASE WHEN RawData LIKE '%*Other information:%'
THEN RTRIM(LTRIM(SUBSTRING(RawData,
CHARINDEX('*Other information:',RawData)+LEN('*Other information:'),
1000)))
ELSE 'na'
END
FROM
(--==== Derived table "s" reassembles the data from the "Raw1" table and makes some minor data corrections
SELECT t1.ItemMark, SectionNumber, ItemType, ItemNumber,
REPLACE(REPLACE(STUFF((SELECT ' ' + t2.RawData FROM Raw1 t2 WHERE t1.ItemMark = t2.ItemMark FOR XML PATH('')),1,1,''),' :',':'),' ',' ') AS RawData
FROM Raw1 t1
GROUP BY t1.ItemMark, SectionNumber, ItemType, ItemNumber
)s
For your convenience, I've attached that bit of T-SQL in the file called "Work 01.txt" (not allowed to attach .SQL extensions on this forum).
Before you try to run the code, find the two places that have the word "TODO"... you must make a couple of path and file name changes to match your environment. Normally, I'd have these in a configuration table, as well, but simply didn't do it for this "freebie".
If you have any questions, please read the comments in the code first, then study the code before you ask... The comments aren't the greatest I've ever written but, like I said, I'm a bit pressed for time this week...
--Jeff Moden
Change is inevitable... Change for the better is not.
January 3, 2008 at 9:50 pm
p.s. I just noticed that I left the code in that creates the "Raw2" table... that code is just not necessary and may be removed. Sorry for the non-lethal mistake... 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply