importing text file with bad lines

  • I'm trying to import a 2 million line file that messes up on odd bad lines.

    I've written a script to parse the content from a single column table into a second 18 column table but it is taking far too long to run.

    Is there a way to script DTS to ignore the (short) bad lines and import directly to the table, with all comma-delimited content in the right place?

    I tried 'if len(DTSsource) > 200', but my knowledge of vb is too sketchy for the error code....

  • I presume you are importing from a text file and delimiting by comma. If so DTS should ignore any columns in excess of the mapping and set null for any that are missing (as long as the columns allow nulls). One way for DTS to really get upset is when you have CR/LF in quoted text. I found this out when I tried to import windows event log.

    Can you post the table ddl and sample data where the problem is.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • I think you're on the right track there. Inside your Tranform Data Task, you want to create a ActiveX Script Transformation with no source or destination columns.

    Inside the Main() function, you would test certain columns to see if they contain data. If not, then the whole row can be skipped. Here's some code from one of my packages:

    Function Main()

    Dummy1= Right(DTSSource("Col001"),7)

    If Dummy1 = "PLANNED" Then

    'this will skip the entire row

    Main = DTSTransformStat_SkipRow

    else

    'this will continue processing the row

    Main = DTSTransformStat_OK

    end if

    End Function

    Note that there are other variations on the DTSTransformStat_??????? constant that will call error handler routines or write the row to an exception file if that is what you want to happen. Hope this helps...

  • yes - its a log file with csv properties.

    I know DTS shoulddo that, but it keeps hitting 'none-whitespace' characters in the broken lines an kicking out about them, rather than dropping NULL in.

    NO qouted texts either.

    ''''''''''''''''''

    delete log20030805

    declare @intidx1 int,@intidx0 int

    declare @charcol1 char(25), @charcol2 char(25),@charcol3 char(25),@charcol4 char(25),@charcol5 datetime,

    @charcol6 datetime,@charcol7 char(25),@charcol8 char(25),@charcol9 char(25),@charcol10 char(255),

    @charcol11 char(25),@charcol12 int ,@charcol13 int, @charcol14 int,@charcol15 char(10),

    @charcol16 char(25), @charcol17 char(25),@charcol18 varchar(2500)

    set dateformat mdy

    declare @intoffs int

    --set @intidx0 = 0

    --set @int

    declare @coll1 varchar(8000)

    declare curname cursor for select [col001] from webd20030805

    open curname

    fetch next from curname into @coll1

    declare @rn int, @drn int

    set @rn = 1

    set @drn = 0

    while @@fetch_status <> -1

    begin

    if len(rtrim(@coll1)) > 200

    begin

    -- print'row'

    select @rn

    select @coll1

    set @intidx0 =1

    set @intoffs = 0

    -- ip

    set @intidx1 = (select charindex(',',@coll1,0))

    set @charcol1 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    -- print'coll ip'

    --select @charcol1

    set @intidx0 = @intidx1+1

    set @intoffs = @intoffs + @intidx0

    -- user

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol2 = (select substring(@coll1,@intoffs,@intidx1-@intidx0))

    -- print'coll user'

    --select @charcol2

    set @intidx0 = @intidx1+1

    set @intoffs = @intoffs + @intidx0

    -- agent

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol3 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll agent'

    --select @charcol3

    -- auth - skip

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol4 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll auth'

    --select @charcol4

    -- date usa

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol5 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll date'

    --select @charcol5

    -- time

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol6 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll time'

    --select @charcol6

    -- ?

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol7 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll ?'

    --select @charcol7

    -- 'proxy1'

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol8 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll proxy'

    --select @charcol8

    -- referrer

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol9 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll ref'

    --select @charcol9

    -- destination

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol10 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll dest'

    --select @charcol10

    -- destination ip

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol11 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll dest ip'

    --select @charcol11

    -- destination port

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol11 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll dest port'

    --select @charcol11

    -- proc time

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol12= (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll proc time'

    --select @charcol12

    -- sent

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol13 =(select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll sent'

    --select @charcol13

    -- rec

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol14 =(select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll rec'

    --select @charcol14

    -- protocol name

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol15 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll prot name'

    --select @charcol15

    -- ?

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol16 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll ?'

    --select @charcol16

    -- operation

    set @intidx1 = (select charindex(',',@coll1,@intidx0))

    set @charcol17 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    -- print'coll oper'

    --select @charcol17

    -- object

    set @intidx1 = (select charindex(', -,',@coll1,@intidx0))

    -- print'idx1'

    --select @intidx1

    set @charcol18 = right(@coll1,(len(@coll1)-@intidx0))

    --set @charcol18 = (select substring(@coll1,@intidx0,@intidx1-@intidx0))

    set @intidx0 = @intidx1+1

    --print'coll obj'

    --select @charcol18

    insert log20030805 values(@charcol1,@charcol2,@charcol3,@charcol4,@charcol5,@charcol6, @charcol7,@charcol8,

    @charcol9,@charcol10,@charcol11,@charcol12,@charcol13,@charcol14,@charcol15,@charcol6,@charcol17,@charcol18)

    dump transaction isa with truncate_only

    end

    else

    begin

    set @drn = @drn + 1

    print 'dropped row'

    end

    set @rn = @rn+1

    fetch next from curname into @coll1

    end

    print ' dropped rows'

    select @drn

    close curname

    deallocate curname

    '''''''''''''''''''''''''''

    is what I'm trying to do with the parsing, and would like to be able to do the same via DTS.

    "172.16.240.75, anonymous, Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0), N, 8/5/2003, 0:00:01, -, PROXY1, -, http://www.google.com, -, 80, 0, 373, 2774, http, -, GET, http://www.google.com/images/res0.gif, -, -, 407, -, -, -" is a good line

    "-, -, Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) Active Cache Request, N, 8/5/2003, 0:46:01, -, PROXY1,

    "

    is a bad line

  • Check the data file, if you have copied an iis log file (which your data looks like) that is still being used by iis then the file will have approx 50k of 0x00 characters which is something to do with iis buffering I think. This would cause the error about line too long and non-white space characters.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • no, its an old log file.

    any way to remove the 0x00 content?

  • You could use a hex editor if you have one. It would also find out if the file contains any 0x00 characters. Normally in these situations I'd write a one off application to validate and process the file.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • It looks to me like your "bad" lines don't have enough commas, yes? I have written scrubbing routines before in an ActiveX script within a DTS package that use the File System Object to write the dirty file to a clean file first, prior to importing into the table. You can count the number of commas in a line and respond to the count, either by skipping the line or adding commas. Do any other cleanup you need as well while you're in there. Then use your clean text file for import.

    Just a thought.

    J. Moseley

    [font="Courier New"]ZenDada[/font]

  • thanks for the responses and suggestions.

    I did try and load the file into a hex-aware editor, but its 500Mb and the time taken to load, search etc would be just too much.

    TnJMoseley - commas are likely the root of it, as once the file is imported as a single line, which works, I can easily detect 'short' lines in my published script. For sake of time, I'm going to cleanse once the data is in and then redo the import.

    Whilst I could count the imcoming commas, its the syntax in content that has me stuck - if you could give me an example, that would be much appreciated!

  • Depending on what software you are going to use. Here is vbscript code that will count the number of commas in each line.

    Const ForReading = 1, ForWriting = 2, ForAppending = 8 
    
    Dim fso
    Dim fin
    Dim fout
    Dim Buffer
    Dim Pos
    Dim ct
    Dim Recct
    Set fso = CreateObject("Scripting.FileSystemObject")
    Set fin = fso.OpenTextFile("inputfile", ForReading)
    Set fout = fso.OpenTextFile("outputfile", ForWriting, True)
    Recct = 0
    Do While fin.AtEndOfStream <> True
    Buffer = fin.ReadLine
    Recct = Recct + 1
    Pos = 0
    ct = 0
    Pos = Instr(Pos+1,Buffer,",")
    Do Until Pos = 0
    ct = ct + 1
    Pos = Instr(Pos+1,Buffer,",")
    Loop
    'Check ct (number of commas) here
    fout.WriteLine(Buffer)
    Loop
    fin.Close
    fout.Close
    Set fso = Nothing

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Thanks, David! SMB, let us know how it goes.

    J. Moseley

    [font="Courier New"]ZenDada[/font]

  • I have a similar situation with a 3 million row table with about 250000 rows that have commas within the csv fields. This causes the data in some cases to get pushed into different fields, which generates the error mentioned.

    I simply process the file until it fails due to the error (fortunately, all the errors are near the eof). Then I capture the bad rows by re-setting the start row in the transform task options tab, and feed them into a new smaller badrows textfile. Although there are only 22 real fields, I have to manually add enough empty fields to the first row in the file for DTS to be able to find the number of rows it is looking for-in this case 29 (ie ,"","","","" ... ). Then I transform the data by pulling the misplaced fields back to where they belong (ie update ... set field7=field7&field8, field8=field9, ... where ISNUMERIC(SUBSTRING ((col011),1,1))=0 --ie state is in zip because there's a comma in an address field).

    I've been meaning to automate the process of adding the empty fields in the first row, but it is quick and simple to do manually. Perhaps I'll re-think that now, incorporating some of DavidBurrows' code - Thanks David.

    By the way Textpad (shareware) works great with large files. It takes a couple of minutes to open a 550MB file, but then I can use "find" to quickly find what I'm looking for. Another nice feature is view|line numbers to get a count of the rows directly from the text file.

    Bill

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply