September 17, 2003 at 6:04 am
I'm trying to import a 2 million line file that messes up on odd bad lines.
I've written a script to parse the content from a single column table into a second 18 column table but it is taking far too long to run.
Is there a way to script DTS to ignore the (short) bad lines and import directly to the table, with all comma-delimited content in the right place?
I tried 'if len(DTSsource) > 200', but my knowledge of vb is too sketchy for the error code....
September 17, 2003 at 7:07 am
I presume you are importing from a text file and delimiting by comma. If so DTS should ignore any columns in excess of the mapping and set null for any that are missing (as long as the columns allow nulls). One way for DTS to really get upset is when you have CR/LF in quoted text. I found this out when I tried to import windows event log.
Can you post the table ddl and sample data where the problem is.
Far away is close at hand in the images of elsewhere.
Anon.
September 17, 2003 at 8:46 am
I think you're on the right track there. Inside your Tranform Data Task, you want to create a ActiveX Script Transformation with no source or destination columns.
Inside the Main() function, you would test certain columns to see if they contain data. If not, then the whole row can be skipped. Here's some code from one of my packages:
Function Main()
Dummy1= Right(DTSSource("Col001"),7)
If Dummy1 = "PLANNED" Then
'this will skip the entire row
Main = DTSTransformStat_SkipRow
else
'this will continue processing the row
Main = DTSTransformStat_OK
end if
End Function
Note that there are other variations on the DTSTransformStat_??????? constant that will call error handler routines or write the row to an exception file if that is what you want to happen. Hope this helps...
September 17, 2003 at 8:49 am
yes - its a log file with csv properties.
I know DTS shoulddo that, but it keeps hitting 'none-whitespace' characters in the broken lines an kicking out about them, rather than dropping NULL in.
NO qouted texts either.
''''''''''''''''''
delete log20030805
declare @intidx1 int,@intidx0 int
declare @charcol1 char(25), @charcol2 char(25),@charcol3 char(25),@charcol4 char(25),@charcol5 datetime,
@charcol6 datetime,@charcol7 char(25),@charcol8 char(25),@charcol9 char(25),@charcol10 char(255),
@charcol11 char(25),@charcol12 int ,@charcol13 int, @charcol14 int,@charcol15 char(10),
@charcol16 char(25), @charcol17 char(25),@charcol18 varchar(2500)
set dateformat mdy
declare @intoffs int
--set @intidx0 = 0
--set @int
declare @coll1 varchar(8000)
declare curname cursor for select [col001] from webd20030805
open curname
fetch next from curname into @coll1
declare @rn int, @drn int
set @rn = 1
set @drn = 0
while @@fetch_status <> -1
begin
if len(rtrim(@coll1)) > 200
begin
-- print'row'
select @rn
select @coll1
set @intidx0 =1
set @intoffs = 0
-- ip
set @intidx1 = (select charindex(',',@coll1,0))
set @charcol1 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
-- print'coll ip'
--select @charcol1
set @intidx0 = @intidx1+1
set @intoffs = @intoffs + @intidx0
-- user
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol2 = (select substring(@coll1,@intoffs,@intidx1-@intidx0))
-- print'coll user'
--select @charcol2
set @intidx0 = @intidx1+1
set @intoffs = @intoffs + @intidx0
-- agent
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol3 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll agent'
--select @charcol3
-- auth - skip
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol4 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll auth'
--select @charcol4
-- date usa
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol5 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll date'
--select @charcol5
-- time
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol6 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll time'
--select @charcol6
-- ?
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol7 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll ?'
--select @charcol7
-- 'proxy1'
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol8 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll proxy'
--select @charcol8
-- referrer
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol9 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll ref'
--select @charcol9
-- destination
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol10 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll dest'
--select @charcol10
-- destination ip
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol11 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll dest ip'
--select @charcol11
-- destination port
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol11 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll dest port'
--select @charcol11
-- proc time
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol12= (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll proc time'
--select @charcol12
-- sent
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol13 =(select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll sent'
--select @charcol13
-- rec
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol14 =(select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll rec'
--select @charcol14
-- protocol name
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol15 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll prot name'
--select @charcol15
-- ?
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol16 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll ?'
--select @charcol16
-- operation
set @intidx1 = (select charindex(',',@coll1,@intidx0))
set @charcol17 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
-- print'coll oper'
--select @charcol17
-- object
set @intidx1 = (select charindex(', -,',@coll1,@intidx0))
-- print'idx1'
--select @intidx1
set @charcol18 = right(@coll1,(len(@coll1)-@intidx0))
--set @charcol18 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))
set @intidx0 = @intidx1+1
--print'coll obj'
--select @charcol18
insert log20030805 values(@charcol1,@charcol2,@charcol3,@charcol4,@charcol5,@charcol6, @charcol7,@charcol8,
@charcol9,@charcol10,@charcol11,@charcol12,@charcol13,@charcol14,@charcol15,@charcol6,@charcol17,@charcol18)
dump transaction isa with truncate_only
end
else
begin
set @drn = @drn + 1
print 'dropped row'
end
set @rn = @rn+1
fetch next from curname into @coll1
end
print ' dropped rows'
select @drn
close curname
deallocate curname
'''''''''''''''''''''''''''
is what I'm trying to do with the parsing, and would like to be able to do the same via DTS.
"172.16.240.75, anonymous, Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0), N, 8/5/2003, 0:00:01, -, PROXY1, -, http://www.google.com, -, 80, 0, 373, 2774, http, -, GET, http://www.google.com/images/res0.gif, -, -, 407, -, -, -" is a good line
"-, -, Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) Active Cache Request, N, 8/5/2003, 0:46:01, -, PROXY1,
"
is a bad line
September 17, 2003 at 9:15 am
Check the data file, if you have copied an iis log file (which your data looks like) that is still being used by iis then the file will have approx 50k of 0x00 characters which is something to do with iis buffering I think. This would cause the error about line too long and non-white space characters.
Far away is close at hand in the images of elsewhere.
Anon.
September 17, 2003 at 9:18 am
no, its an old log file.
any way to remove the 0x00 content?
September 17, 2003 at 10:01 am
You could use a hex editor if you have one. It would also find out if the file contains any 0x00 characters. Normally in these situations I'd write a one off application to validate and process the file.
Far away is close at hand in the images of elsewhere.
Anon.
September 17, 2003 at 2:22 pm
It looks to me like your "bad" lines don't have enough commas, yes? I have written scrubbing routines before in an ActiveX script within a DTS package that use the File System Object to write the dirty file to a clean file first, prior to importing into the table. You can count the number of commas in a line and respond to the count, either by skipping the line or adding commas. Do any other cleanup you need as well while you're in there. Then use your clean text file for import.
Just a thought.
J. Moseley
[font="Courier New"]ZenDada[/font]
September 18, 2003 at 4:07 am
thanks for the responses and suggestions.
I did try and load the file into a hex-aware editor, but its 500Mb and the time taken to load, search etc would be just too much.
TnJMoseley - commas are likely the root of it, as once the file is imported as a single line, which works, I can easily detect 'short' lines in my published script. For sake of time, I'm going to cleanse once the data is in and then redo the import.
Whilst I could count the imcoming commas, its the syntax in content that has me stuck - if you could give me an example, that would be much appreciated!
September 18, 2003 at 5:36 am
Depending on what software you are going to use. Here is vbscript code that will count the number of commas in each line.
Const ForReading = 1, ForWriting = 2, ForAppending = 8
Dim fso
Dim fin
Dim fout
Dim Buffer
Dim Pos
Dim ct
Dim Recct
Set fso = CreateObject("Scripting.FileSystemObject")
Set fin = fso.OpenTextFile("inputfile", ForReading)
Set fout = fso.OpenTextFile("outputfile", ForWriting, True)
Recct = 0
Do While fin.AtEndOfStream <> True
Buffer = fin.ReadLine
Recct = Recct + 1
Pos = 0
ct = 0
Pos = Instr(Pos+1,Buffer,",")
Do Until Pos = 0
ct = ct + 1
Pos = Instr(Pos+1,Buffer,",")
Loop
'Check ct (number of commas) here
fout.WriteLine(Buffer)
Loop
fin.Close
fout.Close
Set fso = Nothing
Far away is close at hand in the images of elsewhere.
Anon.
September 18, 2003 at 8:13 am
Thanks, David! SMB, let us know how it goes.
J. Moseley
[font="Courier New"]ZenDada[/font]
September 26, 2003 at 1:38 pm
I have a similar situation with a 3 million row table with about 250000 rows that have commas within the csv fields. This causes the data in some cases to get pushed into different fields, which generates the error mentioned.
I simply process the file until it fails due to the error (fortunately, all the errors are near the eof). Then I capture the bad rows by re-setting the start row in the transform task options tab, and feed them into a new smaller badrows textfile. Although there are only 22 real fields, I have to manually add enough empty fields to the first row in the file for DTS to be able to find the number of rows it is looking for-in this case 29 (ie ,"","","","" ... ). Then I transform the data by pulling the misplaced fields back to where they belong (ie update ... set field7=field7&field8, field8=field9, ... where ISNUMERIC(SUBSTRING ((col011),1,1))=0 --ie state is in zip because there's a comma in an address field).
I've been meaning to automate the process of adding the empty fields in the first row, but it is quick and simple to do manually. Perhaps I'll re-think that now, incorporating some of DavidBurrows' code - Thanks David.
By the way Textpad (shareware) works great with large files. It takes a couple of minutes to open a 550MB file, but then I can use "find" to quickly find what I'm looking for. Another nice feature is view|line numbers to get a count of the rows directly from the text file.
Bill
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply