Importing Poorly Formatted Text File to SQL

Question

Importing Poorly Formatted Text File to SQL

mxy

SSCarpal Tunnel

Points: 4036
More actions
January 30, 2016 at 8:49 am

#311148

I have below formatted text file each record for a user is separated with space, userid is unique value.
i want to import to SQL either using SSIS or bulk insert, i need o/p as below format
Text file
dn: userid=test123,ou=company.com
objectClass: top
userid: test123
companyVal: 1234
preferredLanguage:: IA==
company: Yahoo
job-title: Admin
job-title: approve_perm
mail: a.b@test.com
createdby:
modifiedby:
createdTime: 20110917180228Z
modifiedTime: 20110917180228Z
dn: userid=test123,ou=company.com
objectClass: top
userid: test1235
companyVal: 12345
preferredLanguage:: IA==
pyramid-group: 5
company: Yahoo
mail: abc@test.com
createdby:
modifiedby:
createdTime: 20110917180228Z
modifiedTime: 20110917180228Z
dn: userid=test12345,ou=company.com
objectClass: top
userid: test12345
companyVal: 12345
preferredLanguage:: IA==
company: google
job-title: Admin
job-title: approve_perm
mail: a.b@test.com
createdby:
modifiedby:
createdTime: 20110917180228Z
modifiedTime: 20110917180228Z
Desired o/p format
Userid CompanyVal Company Job-title Mail Createdby Modifiedby Createdtime Modifiedtime
test123 1234 Yahoo admin,approve_perm a.b@test.com 20110917 20110917

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Hugo Kornelis SSC Guru Points: 64790 More actions · Answer 1

I am far from an SSIS expert, but with the little I know I think that you'll have to use an Execute Script task for this. Which means you'll need to have some C# or VB coding skills to pull this off. I do not have such skills (if I had to do this it would cost me a full day with lots of trial and error, lots of internet searches, lots of frustration and a great sense of achievement when it finally works). But the rough outline of the code would be as follows:

Initialise and set up

While (not end of data)

Initialise output row

While (not end of output row) /* I think you can recognise this from the empty line */

Read input line

Parse line and store in appropriate column

End

Create output row

End

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Phil Parkin SSC Guru Points: 246985 More actions · Answer 2

Hugo is heading in the right direction. This is quite tricky. Here is an outline solution.

Create a flat file connection pointing to your source file. Set the connection to have only one column.

Add a dataflow task, with the flat file as the source.

Add a script component transformation and connect your source to it.

Make the script component asynchronous (number of rows in <> number of rows out).

Create an output from the SC which contains the columns you want.

Edit the code inside the SC to consume the rows from the flat file, assigning the results to output columns as each line is processed and then, when a new record is found, to send the row to the output you have defined.

This code will take a while to write, but it's not that difficult. You will find examples if you search for them.

The SC should be connected to your target table (using the output you defined) and columns mapped accordingly.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

Ed Wagner SSC Guru Points: 287026 More actions · Answer 3

I don't know SSIS, but I don't think you need it for this. That's not that horrible of a format to deal with. It looks like you have a key/value pair consistently delimited by a colon. You mentioned BULK INSERT in your original post and I think that's the right direction for this.

The first step will be to import the text file into a single column in a staging table. From there, you can use DelimitedSplit8K to split the variable-length strings (key and value) to their own columns. If you aren't familiar with DelimitedSplit8K yet, it's explained in Jeff Moden's article at http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Now that you have the key and value in separate columns, you can start working with it. You'll need to use several techniques to get to where you need to be. I have an article on detecting differences in files on servers at http://www.sqlservercentral.com/articles/T-SQL/127804/. The overall topic is completely unrelated, but if you scroll down to the heading "Parsing the Data", you'll find a walkthrough of the techniques you need. The end goal is to get the different values consolidated into a single row and different columns for each value.

If you're on SQL 2012 and the order of the key/value pairs is consistent in all blocks you need to import, I think you'll find the LEAD and LAG functions to be especially useful after you import and split your strings.

I'm sorry that I don't have time to write the code out for you. I enjoy this type of thing and wish I had the time to devote to it.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

mxy SSCarpal Tunnel Points: 4036 More actions · Answer 4

Thanks everyone for your response. I dont know much about c#, will try solutions provided by Phil and Wagner.

thanks

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 5

I wanted to do this with only native SSIS objects but am not seeing a way. Pivot was promising but the source data does not have a grouping attribute, e.g. an ID on every line of the file that would tie each profile together. My second inclination after bailing on Pivot was to use a Script Component as a Source but ": " (notice we need to type in a custom 2-byte delimiter with a space after the colon because preferredLanguage has a double-colon) in the data is a natural delimiter so I went with a Flat File Source followed by a Script Component setup for Transformation, basically what Phil described while I was playing around with this. Attached is a proof-of-concept solution.

The zip contains:

1. ActiveDirectoryLog.sql - DDL to create destination table in tempdb

2. ActiveDirectoryLog.txt - sample input file (from OP)

3. ImportActiveDirectoryLog.dtsx - SSIS Package that loads the sample file into the table in tempdb

Note, the SSIS is in SQL 2016 format so you'll need to download the CTP and choose to install SSDT to be able to open this and try it out.

Some things I noticed about your input file that will add some wrinkles:

EDIT: update image

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 6

@orlando,

I've gotten away from SSIS so much that I don't even know how to open SSIS and load a package. I'm really interested in seeing what you've done. Can you give me a quick lesson on how to load the package. I think I can follow my nose after that. Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 7

mxy (1/30/2016)
I have below formatted text file each record for a user is separated with space, userid is unique value.
i want to import to SQL either using SSIS or bulk insert, i need o/p as below format
Text file
dn: userid=test123,ou=company.com
objectClass: top
userid: test123
companyVal: 1234
preferredLanguage:: IA==
company: Yahoo
job-title: Admin
job-title: approve_perm
mail: a.b@test.com
createdby:
modifiedby:
createdTime: 20110917180228Z
modifiedTime: 20110917180228Z
dn: userid=test123,ou=company.com
objectClass: top
userid: test1235
companyVal: 12345
preferredLanguage:: IA==
pyramid-group: 5
company: Yahoo
mail: abc@test.com
createdby:
modifiedby:
createdTime: 20110917180228Z
modifiedTime: 20110917180228Z
dn: userid=test12345,ou=company.com
objectClass: top
userid: test12345
companyVal: 12345
preferredLanguage:: IA==
company: google
job-title: Admin
job-title: approve_perm
mail: a.b@test.com
createdby:
modifiedby:
createdTime: 20110917180228Z
modifiedTime: 20110917180228Z
Desired o/p format
Userid CompanyVal Company Job-title Mail Createdby Modifiedby Createdtime Modifiedtime
test123 1234 Yahoo admin,approve_perm a.b@test.com 20110917 20110917

You still need to tell us what you want to do with duplicate "field names" as Orlando has suggested. I just figure that we can't correct a blank piece of paper so here's my go at it using BULK INSERT.

For a BULK INSERT solution, the first thing we need to do with code is to create a BCP format file because of the columns in the staging table that don't exist in the data. This is essential because absolutely need the IDENTITY column to preserve the order of input or this whole thing falls apart. Here's the format file that I used. Think of this as the "mapping" step.

Here's the BCP format file I used. I stored it in my C:\Temp directory on my box. You can certainly change that. I know nothing of your file names so I just called it JobFile.fmt.

[font="Arial Black"]{EDIT} NOTE THAT THIS FORUM ATE THE "\ n" (without the space) SO YOU NEED TO REMOVE THE SPACE BEFORE YOU SAVE THE BCP FORMAT FILE BELOW!!![/font]

9.0

6

1 SQLCHAR 0 0 "" 0 LineNumber ""

2 SQLCHAR 0 0 "" 0 RecordNumber ""

3 SQLCHAR 0 0 "" 0 ItemName ""

4 SQLCHAR 0 0 "" 0 ItemValue ""

5 SQLCHAR 0 0 "" 0 ItemInstance ""

6 SQLCHAR 0 1000 "\ n" 6 LineData ""

Next, I stored your example file in my C:\Temp directory, as well. Again, you can certainly change that.

Having done those two things and ensuring that your server can see them and that the files are not open by anything else, the following code accomplishes what you want. Details are in the comments, where they belong. I strongly recommend you lookup and study BCP FORMAT FILES and CROSSTABs.

--Reference: http://www.sqlservercentral.com/Forums/Topic1757028-3077-1.aspx

--=====================================================================================================================

-- Prepare the file staging table

--=====================================================================================================================

--===== If the file staging table exists, drop it to make reruns in SSMS easier.

IF OBJECT_ID('tempdb..#FileStaging','U') IS NOT NULL

DROP TABLE #FileStaging

;

--===== Create the initial file table.

-- Notice the PERSISTED computed columns that do some parsing as soon as a row appears in the table.

CREATE TABLE #FileStaging

(

LineNumber INT IDENTITY(1,1) PRIMARY KEY CLUSTERED

,RecordNumber INT

,ItemName AS RTRIM(LTRIM(REPLACE(SUBSTRING(LineData,1,NULLIF(CHARINDEX(':',LineData),0)-1),CHAR(13),''))) PERSISTED

,ItemValue AS RTRIM(LTRIM(REPLACE(SUBSTRING(LineData,DATALENGTH(LineData)-NULLIF(CHARINDEX(':',REVERSE(LineData)),0)+2,500),CHAR(13),''))) PERSISTED

,ItemInstance INT

,LineData VARCHAR(500)

)

;

--=====================================================================================================================

-- Load the data from the file

-- (Hard coded for this example. Convert to dynamic SQL if desired or necessary.)

--=====================================================================================================================

--===== Load the file staging table from the file.

-- Because of the BCP format file, the first 5 columns of the table are ignored.

BULK INSERT #FileStaging

FROM 'C:\Temp\Jobfile.txt'

WITH (

BATCHSIZE = 2000000000

,CODEPAGE = 'RAW'

,DATAFILEtYPE = 'CHAR'

,FORMATFILE = 'C:\Temp\JobFile.fmt'

,TABLOCK

)

;

--=====================================================================================================================

-- Add the record numbers to the staging table so that we can group on them when we pivot the data for

-- final result.

--=====================================================================================================================

--===== Local obviously named variables

DECLARE @SafetyCounter INT

,@RecordNumber INT;

SELECT @SafetyCounter = 1

,@RecordNumber = 1

;

--===== Use a Quirky Update (with a safety counter) to isolate the records and assign record numbers based on the

-- blank lines between the records. Hopefully, the provider of the data is consistent at least about that.

-- If you're using 2012, I think you could use PRECEEDING ROWS to do similar but this works in 2005 and up.

-- Note that a "record" consists of multiple "lines", in this case.

WITH cteRN AS

(

SELECT LineNumber

,RecordNumber

,LineData

FROM #FileStaging

)

UPDATE tgt

SET @RecordNumber

= RecordNumber --The two back-to-back equals signs are a key part of the Quirky Update.

= CASE --It's improperly documented in Books Online (they don't know how to use it) and definitely NOT supporte by MicroSoft.

WHEN LineNumber = @SafetyCounter

THEN CASE WHEN LineData > ' ' THEN @RecordNumber ELSE @RecordNumber+1 END

ELSE 'Sequence Error'

END

, @SafetyCounter = @SafetyCounter + 1

FROM cteRN tgt WITH (TABLOCKX, INDEX(1))

OPTION (MAXDOP 1)

;

--=====================================================================================================================

-- Enumerate the instances of the ItemName so we can handle multiple instances per record.

-- Note that we haven't yet been told how you want to handle such a problem.

--=====================================================================================================================

WITH cteEnumerate AS

(

SELECT InstanceNumber = ROW_NUMBER() OVER (PARTITION BY RecordNumber,ItemName ORDER BY LineNumber)

,ItemInstance

FROM #FileStaging

)

UPDATE cteEnumerate

SET ItemInstance = InstanceNumber

;

--=====================================================================================================================

-- Create the final output.

-- AGAIN, WE HAVEN'T YET BEEN TOLD HOW YOU WANT TO HANDLE DUPLICATED ITEMNAMES AND RELATED VALUES.

-- THIS ONLY USES THE FIRST VALUE WHEN SUCH DUPLICATION IS PRESENT.

-- I also recommend that you change the dates to a real date/time datatype but didn't do so here.

--=====================================================================================================================

SELECT UserID = MAX(CASE WHEN ItemName = 'userid' THEN ItemValue ELSE '' END)

,CompanyVal = MAX(CASE WHEN ItemName = 'companyVal' THEN ItemValue ELSE '' END)

,Company = MAX(CASE WHEN ItemName = 'company' THEN ItemValue ELSE '' END)

,JobTitle = MAX(CASE WHEN ItemName = 'job-title' THEN ItemValue ELSE '' END)

,eMail = MAX(CASE WHEN ItemName = 'mail' THEN ItemValue ELSE '' END)

,CreatedBy = MAX(CASE WHEN ItemName = 'createdby' THEN ItemValue ELSE '' END)

,ModifiedBy = MAX(CASE WHEN ItemName = 'modifiedby' THEN ItemValue ELSE '' END)

,CreatedDate = MAX(CASE WHEN ItemName = 'createdTime' THEN LEFT(ItemValue,8) ELSE '' END)

,ModifiedDate = MAX(CASE WHEN ItemName = 'modifiedTime' THEN LEFT(ItemValue,8) ELSE '' END)

FROM #FileStaging

WHERE ItemInstance = 1

GROUP BY RecordNumber

;

Using the file data you provided, here's what the above code returns...

UserID CompanyVal Company JobTitle eMail CreatedBy ModifiedBy CreatedDate ModifiedDate

----------- ----------- ------- ----------- --------------- ----------- ----------- ----------- ------------

test123 1234 Yahoo Admin a.b@test.com 20110917 20110917

test1235 12345 Yahoo abc@test.com 20110917 20110917

test12345 12345 google Admin a.b@test.com 20110917 20110917

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 8

@mxy,

BTW, what was the DOS command that you used to create the file? The reason I ask is because there may be a simpler method available but I need to know what the full syntax of the command that was used to produce the file. Of course, do not include any real computer or domain names. Replace those with something else.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 9

Jeff Moden (1/30/2016)
@Orlando,
I've gotten away from SSIS so much that I don't even know how to open SSIS and load a package. I'm really interested in seeing what you've done. Can you give me a quick lesson on how to load the package. I think I can follow my nose after that. Thanks.

1. Download latest SQL Server 2016 CTP and minimum install SSDT*.

2. Download the attachment from my previous post and extract the .dtsx file to a local location.

3. Open SSDT then select File>Open>File to find and open the .dtsx.

That should be all the steps you need to do although step 1 could take hours to accomplish due to the size of the download and installation.

Here are some snaps of the SSIS screens:

All the integration work is done in configuring the C# Script Component to setup the behavior (asynchronous, i.e. more rows going into the Component than being output), the Output Columns (i.e. the shape of the dataset being output) and the C# code that decides how to make sense of the rows going in to get the right rows going out.

* SSDT = SQL Server Data Tools was introduced with SQL 2012 and is the replacement for BIDS from SQL 2005 and 2008. SSDT, like BIDS, is just a brand name for "Visual Studio Shell plus Visual Studio Project Types for Database, SSRS, SSIS and SSAS Projects.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 10

Jeff Moden (1/30/2016)
[font="Arial Black"]{EDIT} NOTE THAT THIS FORUM ATE THE "\ n" (without the space) SO YOU NEED TO REMOVE THE SPACE BEFORE YOU SAVE THE BCP FORMAT FILE BELOW!!![/font]

I have had this issue too and you can get around it by inputing \n as the text which will yield \n when printed on this forum. Hit the "Quote" button on this post to see what I mean.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 11

Jeff Moden (1/30/2016)
@mxy,
BTW, what was the DOS command that you used to create the file? The reason I ask is because there may be a simpler method available but I need to know what the full syntax of the command that was used to produce the file. Of course, do not include any real computer or domain names. Replace those with something else.

I did recognize this as Active Directory output as evidenced by my choice of file names in my demo but it didn't occur to me to ask about reshaping the upstream input file. I think because I was so excited about trying to solve the tech challenge around loading such a strange file-format. I think you are onto something though. Now that you raised this it looks like dsquery output which has options for output format, although I do not know how a tool would rationalize multiple instances of properties like job-title and if they did if it would suit the requirement.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 12

Thank you for the post on how to load the SSIS package. I appreciate the time you took. I'll have to pass, though. I was looking forward to seeing an SSIS solution that had no script component.

Shifting gears, I had a comment in my reply to the op about beating on the provider to get the data in a more normal format. I didn't realize this was an AD output until you mentioned it. As you say, it seems like DSQuery or WMIC for AD could do a much better job of producing more readily importable data.

Agreed on the problem of multiple instances of a property.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 13

Orlando Colamatteo (1/30/2016)
I think because I was so excited about trying to solve the tech challenge around loading such a strange file-format.

+1000 to that. It was a bit o' fun!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

mxy SSCarpal Tunnel Points: 4036 More actions · Answer 14

I have received file from other team,i will check with them, how it was generated.

About duplicate columns names, even though column name is repeated value is unique and it can be n number of times per user.

all attributes are not in all profiles - we can ignore those, we dont need those attributes in o/p

Double colon attributes are not required we can ignore those fields