Importing a Unix file to SQL

  • I have a situation where I need to be able import a unix file into SQL. I can get the unix file to the windows environment with no issue, where I am having difficulty is the unix file does not have row terminators, for example a file which has a record length of 300 characters comes down as a single line of 2 million characters or so.

    Is there a way to tell SQL to count each X number of characters as a new record?

  • David the Import Wizard in SSMS is probably the fastest solution.

    A full blown SSIS package, bcp /bulk insert with a format file, or the Import Wizard(which is SSIS Lite, really) are your three options.

    You can simply select fixed width instead of delimited, identify the width of your columns and boom.

    So this screenshot shows the default, and you need to change the drop down for "Format" to Fixed Width, and follow the prompts on the next screen.

    if you want to use bcp or bulk insert, you have to use a format file which defines the width of each column.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Also don't forget the difference in row delimiter for Unix files. Windows typically uses CRLF as the end-of-line character. In Unix, it's typically LF.

  • The row delimiter in this case is the issue. The data file has no row delimiter. Not even the lowly LF.

  • David Lester (11/20/2012)


    I have a situation where I need to be able import a unix file into SQL. I can get the unix file to the windows environment with no issue, where I am having difficulty is the unix file does not have row terminators, for example a file which has a record length of 300 characters comes down as a single line of 2 million characters or so.

    Is there a way to tell SQL to count each X number of characters as a new record?

    What tool are you using to import the data? BCP, BULK INSERT, SSIS, DTS, Other?

    With BCP, you can define the row terminator with the -r argument.

    Most Unix files use a line feed (hex 0A ), so you can use -r 0x0A

    You can verify the hex character that is actually being used by viewing with a hex viewer.

    I usually use the "V File Viewer" available here:

    http://www.fileviewer.com/

    It works well with very large files, and lets you see the actual hex code being used.

  • David Lester (11/20/2012)


    The row delimiter in this case is the issue. The data file has no row delimiter. Not even the lowly LF.

    Are the lines fixed length without a delimiter?

  • Yes. It is a file with fixed width fields & rows, but without any delimiters of any kind.

  • Personally I would use Perl or something like that to pre-format the file before trying to load it.

    eg

    #open files for input and output

    open (IN,"<undelimited.txt") || die ("Failed to open input");

    open (OUT,">delimited.txt") || die ("Failed to open output");

    while ( read (IN, $row, 300) > 0) # read in 300 characters from the file

    {

    # further splits can be done here

    print OUT, $row.""; #output with carriage return

    };

    close IN;

    close OUT;

    excuse any errors, it's been a long time since I used perl

  • I have some code that will build an import script for a fixed width table based on the table description. I find that it generally works (and does Unix too). I'll post an early copy here (definitely not cleaned up well at all) but you might see if it gets you going. And before anyone asks, yes my version has a bulk insert UTF-8 conversion. No, this version doesn't. Also, I should mention that since datetimes might be different lengths, there is a parameter in the code where you can change the default (I use 8 for 'MMDDYYYY')

    How to use:

    DECLARE @import_script VARCHAR(MAX)

    SELECT @import_script = dbo.GenerateBulkScript(<tablename>,<SCHEMA>,<file_path>,'','Unix','FIXED','ASCII')

    EXEC sp_sqlexec @import_script

    -- =============================================

    -- Author:Alan Dykes

    -- Create date: 12/5/2011

    -- Modified: 5/24/2012

    --

    -- 5/03/2012Made changes to work with DATETIME2 as well as DATETIME. This is primarily to deal with

    --pulling UBR table structures from ORACLE which map dates to datetime2

    --

    -- 5/09/2012(ADykes) Added RowTerminatorType,FieldTerminatorType, and Encoding parameters

    --Options are:

    --RowTerminatorType:Unix (Default)

    --Windows

    --

    --FieldTerminatorType:PDF (Default) (added 5/23/2012 for UTF-8)

    --Comma

    -- FIXED (added 5/24/2012 for UTF-8)

    --

    --Encoding:ASCII (Default)

    --UTF-8

    --

    -- 5/23/2012(ADykes) Added UTF-8 style load. If encoding is set to UTF-8 then it bypasses all the current

    --code and has it's own special set. NOTE: At the end of a long day, currently only PDF UTF-8

    --will load correctly. Also, you need to use varchar for the destination tables. This function

    --now sets up the script to call QM.utf8_to_utf16a and is required for UTF-8 files to load

    --

    -- 5/24/2012(ADykes) Added code to strip UTF-8 file header if there(EF BB BF), added back in fixed width

    --functionality for UTF-8 ONLY, Fixed a nasty bug that might generate errors if tables with the

    --same name but different structure exist in the database. Also threw in a check for all temp

    --tables so they can be dropped if there was a previous script issue that left a temp table

    --before a new one is created.

    --

    -- 11/1/2012(ADykes) Changed ASCII bulk insert to use code page 1252 instead of 'OEM' so accents from

    --other languages import correctly

    --

    -- 11/8/2012(ADykes) Added fixed width loads for ASCII files that was developed for a project Scott was

    --working on. Also fixed UTF-8 fixed width loads for numbers

    --

    -- 11/13/2012(ADykes) Added a defined value for the expected length of a datetime string in the data. This

    --Allows for all of our dates to be hard coded as 8 characters in length.

    -- Description:Generate a bulk import script given the table name and file location

    -- =============================================

    CREATE FUNCTION [dbo].[GenerateBulkScript]

    (

    -- Add the parameters for the function here

    @TableName varchar(255),

    @Schema varchar(1000),

    @FileLocation varchar(1000),

    @ImportCommands varchar(3000),

    @RowTerminatorType varchar(25),

    @FieldTerminatorType varchar(25),

    @Encoding varchar(25)

    )

    RETURNS varchar(max)

    AS

    BEGIN

    -- Declare the return variable here

    DECLARE @Script as varchar(max)

    DECLARE @TableSysID as bigint

    DECLARE @ColumnName varchar(500)

    DECLARE @ColumnType varchar(50)

    DECLARE @ColumnLength int

    DECLARE @ColumnPrecision int

    DECLARE @ColumnScale int

    DECLARE @ColumnID int

    ---For Fixed width breakdowns

    DECLARE @FieldList varchar(8000)

    DECLARE @ValuesList varchar(8000)

    DECLARE @RunningStart int

    DECLARE @Work VARCHAR(8000)

    DECLARE @DateLength int

    --Hard coding length of dates as 8

    SELECT @DateLength = 8

    If @Encoding = 'UTF-8'

    BEGIN

    --implemented ONLY for fixed width

    SELECT @Encoding = @Encoding

    END

    ELSE

    BEGIN

    --Default to ASCII

    SELECT @Encoding = 'ASCII'

    END

    IF @RowTerminatorType = 'WINDOWS'

    BEGIN

    SELECT @RowTerminatorType = @RowTerminatorType

    END

    ELSE

    BEGIN

    --Default to Unix

    SELECT @RowTerminatorType = 'UNIX'

    END

    IF @FieldTerminatorType = 'COMMA'

    BEGIN

    SELECT @FieldTerminatorType = @FieldTerminatorType

    END

    ELSE IF @FieldTerminatorType = 'FIXED'

    BEGIN

    SELECT @FieldTerminatorType = @FieldTerminatorType

    END

    ELSE

    BEGIN

    --Default to PDF

    SELECT @FieldTerminatorType = 'PDF'

    END

    IF @Encoding = 'UTF-8' GOTO utf8

    IF @FieldTerminatorType = 'FIXED'

    BEGIN

    SELECT @Script = 'IF OBJECT_ID(''[#TMP_'+@TableName+'raw]'', ''U'') IS NOT NULL DROP TABLE [#TMP_'+@TableName+'raw]; CREATE TABLE [#TMP_'+@TableName+'raw](rawdata VARCHAR(MAX)); '+CHAR(13)+CHAR(10)

    SELECT @Script = @Script+'DECLARE @RowTerminator NVARCHAR(2); '

    SELECT @Script = @Script+'DECLARE @CurrentLocation BIGINT; '

    SELECT @Script = @Script+'DECLARE @NextLineEnd BIGINT; '

    SELECT @Script = @Script+'DECLARE @LoopCount BIGINT; '+CHAR(13)+CHAR(10)

    SELECT @Script = @Script+'BULK INSERT [#TMP_'+@TableName+'raw] FROM '''+@FileLocation+''' WITH(CODEPAGE = ''1252'''

    SELECT @Script = @SCript +',ROWTERMINATOR='''

    IF @RowTerminatorType = 'UNIX' SELECT @Script = @Script + '0X0A'

    ELSE IF @RowTerminatorType = 'WINDOWS' SELECT @Script = @Script +''

    SELECT @Script = @Script + '''); '

    set @RunningStart = 1

    set @FieldList = 'INSERT INTO ' + @Schema+'.['+@TableName+ '] ('

    set @ValuesList = ') Select '

    DECLARE column_cursor CURSOR FOR

    SELECT sys.columns.name AS ColumnName, sys.types.name AS ColumnType, sys.columns.max_length AS ColumnLength, sys.columns.precision AS ColumnPrecision, sys.columns.scale AS ColumnScale, sys.columns.column_id AS ColumnID

    FROM sys.columns INNER JOIN sys.types on sys.columns.user_type_id = sys.types.user_type_id

    inner join sys.tables on sys.columns.object_id = sys.tables.object_id

    INNER JOIN sys.schemas ON sys.schemas.schema_id = sys.tables.schema_id

    WHERE sys.tables.name = @TableName AND sys.schemas.name = @Schema

    ORDER BY sys.columns.column_id

    --Loop through cursor

    OPEN column_cursor

    FETCH NEXT FROM column_cursor

    INTO @ColumnName,@ColumnType,@ColumnLength,@ColumnPrecision,@ColumnScale,@ColumnID

    WHILE @@FETCH_STATUS = 0

    BEGIN

    set @fieldlist = @FieldList + '[' + @ColumnName + '] , '

    if @ColumnType like '%CHAR'

    begin

    set @Work = '(SUBSTRING(rawdata,' + cast(@RunningStart as varchar) + ', '

    + CAST( @ColumnLength as varchar) + ') ) as [' + @ColumnName + '] , '

    set @RunningStart = @RunningStart + (@ColumnLength)

    end

    else IF @ColumnType like 'datetime%'

    BEGIN

    set @Work = 'CAST((CASE WHEN SUBSTRING(rawdata,' + cast(@RunningStart as varchar) + ', '

    + CAST( @DateLength as varchar) + ') = ''00000000'' THEN NULL ELSE SUBSTRING(rawdata,' + cast(@RunningStart as varchar) + ', '

    + CAST( @DateLength as varchar) + ') END) AS '+@ColumnType+') as [' + @ColumnName + '] , '

    set @RunningStart = @RunningStart + 8

    END

    ELSE

    begin

    set @Work = 'CAST((CAST(SUBSTRING(rawdata,' + cast(@RunningStart as varchar) + ', '

    IF @ColumnScale = 0 OR @ColumnType = 'Money' OR @ColumnType LIKE '%time%'

    BEGIN

    SELECT @Work = @Work+CAST( (@ColumnPrecision) as varchar)

    SELECT @Work = @Work + ') AS VARCHAR(MAX))) AS '+@ColumnType+'('+CAST(@ColumnPrecision AS VARCHAR)+')) as [' + @ColumnName + '] , '

    set @RunningStart = @RunningStart + @ColumnPrecision

    END

    ELSE

    BEGIN

    SELECT @Work = @Work+CAST( (@ColumnPrecision) as varchar)

    SELECT @Work = @Work + ')AS NUMERIC('+CAST((@ColumnPrecision+@ColumnScale) AS VARCHAR)+',0))/'+CAST((POWER(10,@ColumnScale)) AS VARCHAR)+') AS '+@ColumnType+'('+CAST(@ColumnPrecision AS VARCHAR)+','+CAST(@ColumnScale AS VARCHAR)+')) as [' + @ColumnName + '] , '

    set @RunningStart = @RunningStart + @ColumnPrecision

    END

    end

    set @ValuesList = @ValuesList + @Work

    FETCH NEXT FROM column_cursor

    INTO @ColumnName,@ColumnType,@ColumnLength,@ColumnPrecision,@ColumnScale,@ColumnID

    END

    CLOSE column_cursor

    DEALLOCATE column_cursor

    set @FieldList = substring(@fieldlist,1, len(@fieldlist)-2)

    set @ValuesList = substring(@valueslist,1, len(@valueslist)-2)

    set @Script = @Script +CHAR(13)+CHAR(10)+ @FieldList + @ValuesList + ' FROM [#TMP_'+@TableName+'raw]; '

    SELECT @Script = @Script + 'DROP TABLE [#TMP_'+@TableName+'raw]; '

    END

    ELSE

    BEGIN

    SELECT @TableSysID = t.OBJECT_ID FROM sys.tables t INNER JOIN sys.schemas s ON s.schema_id = t.schema_id WHERE t.name=@TableName AND s.name=@Schema

    SELECT @Script = 'IF OBJECT_ID(''[#TMP_'+@TableName+']'', ''U'') IS NOT NULL DROP TABLE [#TMP_'+@TableName+']; CREATE TABLE [#TMP_'+@TableName+'] ('+CHAR(13)+CHAR(10)

    -- Cursor to create temp table

    DECLARE column_cursor CURSOR LOCAL READ_ONLY FOR

    SELECT sys.columns.name AS ColumnName, sys.types.name AS ColumnType, sys.columns.max_length AS ColumnLength, sys.columns.precision AS ColumnPrecision, sys.columns.scale AS ColumnScale, sys.columns.column_id AS ColumnID

    FROM sys.columns INNER JOIN sys.types on sys.columns.user_type_id = sys.types.user_type_id

    WHERE object_id = @TableSysID

    ORDER BY sys.columns.column_id

    --Loop through cursor

    OPEN column_cursor

    FETCH NEXT FROM column_cursor

    INTO @ColumnName,@ColumnType,@ColumnLength,@ColumnPrecision,@ColumnScale,@ColumnID

    WHILE @@FETCH_STATUS = 0

    BEGIN

    --Dates need to have 0s converted to NULL

    IF @ColumnType = 'datetime' OR @ColumnType = 'datetime2'

    BEGIN

    --For Date to null conversions in temp table

    SELECT @Script = @Script+@ColumnName+' CHAR('+CAST(@DateLength AS VarChar)+')'

    END

    ELSE IF @ColumnPrecision <> 0

    BEGIN

    IF @ColumnScale = 0 OR @ColumnType = 'Money' OR @ColumnType LIKE '%time%'

    BEGIN

    --Numbers like INT and bit with no scale

    SELECT @Script = @Script+@ColumnName+' '+@ColumnType

    END

    ELSE

    BEGIN

    --Numbers like numeric

    SELECT @Script = @Script+@ColumnName+' '+@ColumnType+'('+CAST(@ColumnPrecision AS VarChar)+','+CAST(@ColumnScale AS VarChar)+')'

    END

    END

    ELSE

    BEGIN

    --Text

    IF @ColumnLength > 0

    BEGIN

    --Standard text types

    SELECT @Script = @Script+@ColumnName+' '+@ColumnType+'('+CAST(@ColumnLength AS VarChar)+')'

    END

    ELSE

    BEGIN

    --Special text types like XML or geography

    SELECT @Script = @Script+@ColumnName+' '+@ColumnType

    END

    END

    IF @@FETCH_STATUS = 0

    BEGIN

    SELECT @Script = @Script+','+CHAR(13)+CHAR(10)

    END

    FETCH NEXT FROM column_cursor

    INTO @ColumnName,@ColumnType,@ColumnLength,@ColumnPrecision,@ColumnScale,@ColumnID

    END

    CLOSE column_cursor

    DEALLOCATE column_cursor

    SELECT @Script = @Script+')'

    --Add Bulk Insert Statement

    SELECT @Script = @Script+CHAR(13)+CHAR(10)+'BULK INSERT [#TMP_'+@TableName+']'

    SELECT @Script = @Script+' FROM '''+@FileLocation+''''

    SELECT @Script = @Script+' WITH (CODEPAGE = 1252,FIELDTERMINATOR='''

    IF @FieldTerminatorType = 'PDF' SELECT @Script = @Script + '|'

    ELSE IF @FieldTerminatorType = 'COMMA' SELECT @Script = @Script + ','

    SELECT @Script = @SCript +''',ROWTERMINATOR='''

    IF @RowTerminatorType = 'UNIX' SELECT @Script = @Script + '0X0A'

    ELSE IF @RowTerminatorType = 'WINDOWS' SELECT @Script = @Script +''

    SELECT @Script = @Script + ''')'

    --SELECT @Script = @Script + ' GO'+CHAR(13)+CHAR(10)

    --Do Date Updates

    -- Cursor to create temp table

    DECLARE column_cursor CURSOR LOCAL READ_ONLY FOR

    SELECT sys.columns.name AS ColumnName,sys.types.name AS ColumnType

    FROM sys.columns INNER JOIN sys.types on sys.columns.user_type_id = sys.types.user_type_id

    WHERE object_id = @TableSysID

    --Loop through cursor

    OPEN column_cursor

    FETCH NEXT FROM column_cursor

    INTO @ColumnName, @ColumnType

    WHILE @@FETCH_STATUS = 0

    BEGIN

    IF @ColumnType = 'datetime' OR @ColumnType = 'datetime2'

    BEGIN

    --For Date to null conversions in temp table

    SELECT @Script = @Script+' UPDATE [#TMP_'+@TableName+'] SET ['+@ColumnName+']=NULL WHERE ['+@ColumnName+']=''00000000'' OR LEN(RTRIM([' + @ColumnName + '])) = 0'+CHAR(13)+CHAR(10)

    END

    FETCH NEXT FROM column_cursor

    INTO @ColumnName, @ColumnType

    END

    CLOSE column_cursor

    DEALLOCATE column_cursor

    --

    --SELECT @Script = @Script+' UPDATE #TMP_'+@TableName+' SET '+@ColumnName+'=NULL WHERE '+@ColumnName+'=''00000000'''+CHAR(13)+CHAR(10)

    SELECT @Script = @Script+' '+REPLACE(@ImportCommands,'?','[#TMP_'+@TableName+']')+CHAR(13)+CHAR(10)

    --

    SELECT @Script = @Script+' INSERT INTO '+@Schema+'.['+@TableName+'] SELECT * FROM [#TMP_'+@TableName+']'+CHAR(13)+CHAR(10)

    SELECT @Script = @Script+' DROP TABLE [#TMP_'+@TableName+']'+CHAR(13)+CHAR(10)

    --SELECT @Script = @Script+' GO'

    END

    GOTO return_script

    utf8:

    return_script:

    -- Return the result of the function

    RETURN @Script

    END

    GO

  • Oh, I neglected to mention that the data files I import don't have decimals (the decimal is implied based off of the field precision\scale) so.... it might not work if your numbers have decimals although, you could just make all fields VARCHAR for your destination table... that would work fairly well.

  • Thank you all. I think with the responses so far I can get this working.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply