importing text file with bad lines

Question

importing text file with bad lines

oldgoat

SSCrazy

Points: 2344
More actions
September 17, 2003 at 6:04 am

#81048

I'm trying to import a 2 million line file that messes up on odd bad lines.
I've written a script to parse the content from a single column table into a second 18 column table but it is taking far too long to run.
Is there a way to script DTS to ignore the (short) bad lines and import directly to the table, with all comma-delimited content in the right place?
I tried 'if len(DTSsource) > 200', but my knowledge of vb is too sketchy for the error code....

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

David Burrows SSC Guru Points: 65143 More actions · Answer 1

I presume you are importing from a text file and delimiting by comma. If so DTS should ignore any columns in excess of the mapping and set null for any that are missing (as long as the columns allow nulls). One way for DTS to really get upset is when you have CR/LF in quoted text. I found this out when I tried to import windows event log.

Can you post the table ddl and sample data where the problem is.

Far away is close at hand in the images of elsewhere.
Anon.

cppwiz SSCarpal Tunnel Points: 4036 More actions · Answer 2

I think you're on the right track there. Inside your Tranform Data Task, you want to create a ActiveX Script Transformation with no source or destination columns.

Inside the Main() function, you would test certain columns to see if they contain data. If not, then the whole row can be skipped. Here's some code from one of my packages:

Function Main()

Dummy1= Right(DTSSource("Col001"),7)

If Dummy1 = "PLANNED" Then

'this will skip the entire row

Main = DTSTransformStat_SkipRow

else

'this will continue processing the row

Main = DTSTransformStat_OK

end if

End Function

Note that there are other variations on the DTSTransformStat_??????? constant that will call error handler routines or write the row to an exception file if that is what you want to happen. Hope this helps...

oldgoat SSCrazy Points: 2344 More actions · Answer 3

yes - its a log file with csv properties.

I know DTS shoulddo that, but it keeps hitting 'none-whitespace' characters in the broken lines an kicking out about them, rather than dropping NULL in.

NO qouted texts either.

''''''''''''''''''

delete log20030805

declare @intidx1 int,@intidx0 int

declare @charcol1 char(25), @charcol2 char(25),@charcol3 char(25),@charcol4 char(25),@charcol5 datetime,

@charcol6 datetime,@charcol7 char(25),@charcol8 char(25),@charcol9 char(25),@charcol10 char(255),

@charcol11 char(25),@charcol12 int ,@charcol13 int, @charcol14 int,@charcol15 char(10),

@charcol16 char(25), @charcol17 char(25),@charcol18 varchar(2500)

set dateformat mdy

declare @intoffs int

--set @intidx0 = 0

--set @int

declare @coll1 varchar(8000)

declare curname cursor for select [col001] from webd20030805

open curname

fetch next from curname into @coll1

declare @rn int, @drn int

set @rn = 1

set @drn = 0

while @@fetch_status <> -1

begin

if len(rtrim(@coll1)) > 200

begin

-- print'row'

select @rn

select @coll1

set @intidx0 =1

set @intoffs = 0

-- ip

set @intidx1 = (select charindex(',',@coll1,0))

set @charcol1 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

-- print'coll ip'

--select @charcol1

set @intidx0 = @intidx1+1

set @intoffs = @intoffs + @intidx0

-- user

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol2 = (select substring(@coll1,@intoffs,@intidx1-@intidx0))

-- print'coll user'

--select @charcol2

set @intidx0 = @intidx1+1

set @intoffs = @intoffs + @intidx0

-- agent

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol3 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll agent'

--select @charcol3

-- auth - skip

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol4 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll auth'

--select @charcol4

-- date usa

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol5 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll date'

--select @charcol5

-- time

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol6 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll time'

--select @charcol6

-- ?

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol7 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll ?'

--select @charcol7

-- 'proxy1'

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol8 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll proxy'

--select @charcol8

-- referrer

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol9 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll ref'

--select @charcol9

-- destination

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol10 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll dest'

--select @charcol10

-- destination ip

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol11 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll dest ip'

--select @charcol11

-- destination port

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol11 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll dest port'

--select @charcol11

-- proc time

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol12= (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll proc time'

--select @charcol12

-- sent

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol13 =(select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll sent'

--select @charcol13

-- rec

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol14 =(select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll rec'

--select @charcol14

-- protocol name

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol15 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll prot name'

--select @charcol15

-- ?

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol16 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll ?'

--select @charcol16

-- operation

set @intidx1 = (select charindex(',',@coll1,@intidx0))

set @charcol17 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

-- print'coll oper'

--select @charcol17

-- object

set @intidx1 = (select charindex(', -,',@coll1,@intidx0))

-- print'idx1'

--select @intidx1

set @charcol18 = right(@coll1,(len(@coll1)-@intidx0))

--set @charcol18 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

set @intidx0 = @intidx1+1

--print'coll obj'

--select @charcol18

insert log20030805 values(@charcol1,@charcol2,@charcol3,@charcol4,@charcol5,@charcol6, @charcol7,@charcol8,

@charcol9,@charcol10,@charcol11,@charcol12,@charcol13,@charcol14,@charcol15,@charcol6,@charcol17,@charcol18)

dump transaction isa with truncate_only

end

else

begin

set @drn = @drn + 1

print 'dropped row'

end

set @rn = @rn+1

fetch next from curname into @coll1

end

print ' dropped rows'

select @drn

close curname

deallocate curname

'''''''''''''''''''''''''''

is what I'm trying to do with the parsing, and would like to be able to do the same via DTS.

"172.16.240.75, anonymous, Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0), N, 8/5/2003, 0:00:01, -, PROXY1, -, http://www.google.com, -, 80, 0, 373, 2774, http, -, GET, http://www.google.com/images/res0.gif, -, -, 407, -, -, -" is a good line

"-, -, Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) Active Cache Request, N, 8/5/2003, 0:46:01, -, PROXY1,

"

is a bad line

David Burrows SSC Guru Points: 65143 More actions · Answer 4

Check the data file, if you have copied an iis log file (which your data looks like) that is still being used by iis then the file will have approx 50k of 0x00 characters which is something to do with iis buffering I think. This would cause the error about line too long and non-white space characters.

Far away is close at hand in the images of elsewhere.
Anon.

oldgoat SSCrazy Points: 2344 More actions · Answer 5

no, its an old log file.

any way to remove the 0x00 content?

David Burrows SSC Guru Points: 65143 More actions · Answer 6

You could use a hex editor if you have one. It would also find out if the file contains any 0x00 characters. Normally in these situations I'd write a one off application to validate and process the file.

Far away is close at hand in the images of elsewhere.
Anon.

ZenDada SSCarpal Tunnel Points: 4637 More actions · Answer 7

It looks to me like your "bad" lines don't have enough commas, yes? I have written scrubbing routines before in an ActiveX script within a DTS package that use the File System Object to write the dirty file to a clean file first, prior to importing into the table. You can count the number of commas in a line and respond to the count, either by skipping the line or adding commas. Do any other cleanup you need as well while you're in there. Then use your clean text file for import.

Just a thought.

J. Moseley

[font="Courier New"]ZenDada[/font]

oldgoat SSCrazy Points: 2344 More actions · Answer 8

thanks for the responses and suggestions.

I did try and load the file into a hex-aware editor, but its 500Mb and the time taken to load, search etc would be just too much.

TnJMoseley - commas are likely the root of it, as once the file is imported as a single line, which works, I can easily detect 'short' lines in my published script. For sake of time, I'm going to cleanse once the data is in and then redo the import.

Whilst I could count the imcoming commas, its the syntax in content that has me stuck - if you could give me an example, that would be much appreciated!

David Burrows SSC Guru Points: 65143 More actions · Answer 9

Depending on what software you are going to use. Here is vbscript code that will count the number of commas in each line.

Const ForReading = 1, ForWriting = 2, ForAppending = 8 

Dim fso 

Dim fin 

Dim fout 

Dim Buffer 

Dim Pos 

Dim ct 

Dim Recct 

Set fso = CreateObject("Scripting.FileSystemObject") 

Set fin = fso.OpenTextFile("inputfile", ForReading) 

Set fout = fso.OpenTextFile("outputfile", ForWriting, True) 

Recct = 0 

Do While fin.AtEndOfStream <> True 

Buffer = fin.ReadLine 

Recct = Recct + 1 

Pos = 0 

ct = 0 

Pos = Instr(Pos+1,Buffer,",") 

Do Until Pos = 0 

ct = ct + 1 

Pos = Instr(Pos+1,Buffer,",") 

Loop 

'Check ct (number of commas) here 

fout.WriteLine(Buffer) 

Loop 

fin.Close 

fout.Close 

Set fso = Nothing

Far away is close at hand in the images of elsewhere.
Anon.

ZenDada SSCarpal Tunnel Points: 4637 More actions · Answer 10

Thanks, David! SMB, let us know how it goes.

J. Moseley

[font="Courier New"]ZenDada[/font]

Bill Nye Hall of Fame Points: 3248 More actions · Answer 11

I have a similar situation with a 3 million row table with about 250000 rows that have commas within the csv fields. This causes the data in some cases to get pushed into different fields, which generates the error mentioned.

I simply process the file until it fails due to the error (fortunately, all the errors are near the eof). Then I capture the bad rows by re-setting the start row in the transform task options tab, and feed them into a new smaller badrows textfile. Although there are only 22 real fields, I have to manually add enough empty fields to the first row in the file for DTS to be able to find the number of rows it is looking for-in this case 29 (ie ,"","","","" ... ). Then I transform the data by pulling the misplaced fields back to where they belong (ie update ... set field7=field7&field8, field8=field9, ... where ISNUMERIC(SUBSTRING ((col011),1,1))=0 --ie state is in zip because there's a comma in an address field).

I've been meaning to automate the process of adding the empty fields in the first row, but it is quick and simple to do manually. Perhaps I'll re-think that now, incorporating some of DavidBurrows' code - Thanks David.

By the way Textpad (shareware) works great with large files. It takes a couple of minutes to open a 550MB file, but then I can use "find" to quickly find what I'm looking for. Another nice feature is view|line numbers to get a count of the rows directly from the text file.

Bill