November 20, 2012 at 9:28 am
I have a situation where I need to be able import a unix file into SQL. I can get the unix file to the windows environment with no issue, where I am having difficulty is the unix file does not have row terminators, for example a file which has a record length of 300 characters comes down as a single line of 2 million characters or so.
Is there a way to tell SQL to count each X number of characters as a new record?
November 20, 2012 at 10:04 am
David the Import Wizard in SSMS is probably the fastest solution.
A full blown SSIS package, bcp /bulk insert with a format file, or the Import Wizard(which is SSIS Lite, really) are your three options.
You can simply select fixed width instead of delimited, identify the width of your columns and boom.
So this screenshot shows the default, and you need to change the drop down for "Format" to Fixed Width, and follow the prompts on the next screen.
if you want to use bcp or bulk insert, you have to use a format file which defines the width of each column.
Lowell
November 20, 2012 at 10:09 am
Also don't forget the difference in row delimiter for Unix files. Windows typically uses CRLF as the end-of-line character. In Unix, it's typically LF.
November 20, 2012 at 10:11 am
The row delimiter in this case is the issue. The data file has no row delimiter. Not even the lowly LF.
November 20, 2012 at 10:13 am
David Lester (11/20/2012)
I have a situation where I need to be able import a unix file into SQL. I can get the unix file to the windows environment with no issue, where I am having difficulty is the unix file does not have row terminators, for example a file which has a record length of 300 characters comes down as a single line of 2 million characters or so.Is there a way to tell SQL to count each X number of characters as a new record?
What tool are you using to import the data? BCP, BULK INSERT, SSIS, DTS, Other?
With BCP, you can define the row terminator with the -r argument.
Most Unix files use a line feed (hex 0A ), so you can use -r 0x0A
You can verify the hex character that is actually being used by viewing with a hex viewer.
I usually use the "V File Viewer" available here:
It works well with very large files, and lets you see the actual hex code being used.
November 20, 2012 at 10:15 am
David Lester (11/20/2012)
The row delimiter in this case is the issue. The data file has no row delimiter. Not even the lowly LF.
Are the lines fixed length without a delimiter?
November 20, 2012 at 10:17 am
Yes. It is a file with fixed width fields & rows, but without any delimiters of any kind.
November 20, 2012 at 12:17 pm
Personally I would use Perl or something like that to pre-format the file before trying to load it.
eg
#open files for input and output
open (IN,"<undelimited.txt") || die ("Failed to open input");
open (OUT,">delimited.txt") || die ("Failed to open output");
while ( read (IN, $row, 300) > 0) # read in 300 characters from the file
{
# further splits can be done here
print OUT, $row.""; #output with carriage return
};
close IN;
close OUT;
excuse any errors, it's been a long time since I used perl
November 20, 2012 at 1:48 pm
I have some code that will build an import script for a fixed width table based on the table description. I find that it generally works (and does Unix too). I'll post an early copy here (definitely not cleaned up well at all) but you might see if it gets you going. And before anyone asks, yes my version has a bulk insert UTF-8 conversion. No, this version doesn't. Also, I should mention that since datetimes might be different lengths, there is a parameter in the code where you can change the default (I use 8 for 'MMDDYYYY')
How to use:
DECLARE @import_script VARCHAR(MAX)
SELECT @import_script = dbo.GenerateBulkScript(<tablename>,<SCHEMA>,<file_path>,'','Unix','FIXED','ASCII')
EXEC sp_sqlexec @import_script
-- =============================================
-- Author:Alan Dykes
-- Create date: 12/5/2011
-- Modified: 5/24/2012
--
-- 5/03/2012Made changes to work with DATETIME2 as well as DATETIME. This is primarily to deal with
--pulling UBR table structures from ORACLE which map dates to datetime2
--
-- 5/09/2012(ADykes) Added RowTerminatorType,FieldTerminatorType, and Encoding parameters
--Options are:
--RowTerminatorType:Unix (Default)
--Windows
--
--FieldTerminatorType:PDF (Default) (added 5/23/2012 for UTF-8)
--Comma
-- FIXED (added 5/24/2012 for UTF-8)
--
--Encoding:ASCII (Default)
--UTF-8
--
-- 5/23/2012(ADykes) Added UTF-8 style load. If encoding is set to UTF-8 then it bypasses all the current
--code and has it's own special set. NOTE: At the end of a long day, currently only PDF UTF-8
--will load correctly. Also, you need to use varchar for the destination tables. This function
--now sets up the script to call QM.utf8_to_utf16a and is required for UTF-8 files to load
--
-- 5/24/2012(ADykes) Added code to strip UTF-8 file header if there(EF BB BF), added back in fixed width
--functionality for UTF-8 ONLY, Fixed a nasty bug that might generate errors if tables with the
--same name but different structure exist in the database. Also threw in a check for all temp
--tables so they can be dropped if there was a previous script issue that left a temp table
--before a new one is created.
--
-- 11/1/2012(ADykes) Changed ASCII bulk insert to use code page 1252 instead of 'OEM' so accents from
--other languages import correctly
--
-- 11/8/2012(ADykes) Added fixed width loads for ASCII files that was developed for a project Scott was
--working on. Also fixed UTF-8 fixed width loads for numbers
--
-- 11/13/2012(ADykes) Added a defined value for the expected length of a datetime string in the data. This
--Allows for all of our dates to be hard coded as 8 characters in length.
-- Description:Generate a bulk import script given the table name and file location
-- =============================================
CREATE FUNCTION [dbo].[GenerateBulkScript]
(
-- Add the parameters for the function here
@TableName varchar(255),
@Schema varchar(1000),
@FileLocation varchar(1000),
@ImportCommands varchar(3000),
@RowTerminatorType varchar(25),
@FieldTerminatorType varchar(25),
@Encoding varchar(25)
)
RETURNS varchar(max)
AS
BEGIN
-- Declare the return variable here
DECLARE @Script as varchar(max)
DECLARE @TableSysID as bigint
DECLARE @ColumnName varchar(500)
DECLARE @ColumnType varchar(50)
DECLARE @ColumnLength int
DECLARE @ColumnPrecision int
DECLARE @ColumnScale int
DECLARE @ColumnID int
---For Fixed width breakdowns
DECLARE @FieldList varchar(8000)
DECLARE @ValuesList varchar(8000)
DECLARE @RunningStart int
DECLARE @Work VARCHAR(8000)
DECLARE @DateLength int
--Hard coding length of dates as 8
SELECT @DateLength = 8
If @Encoding = 'UTF-8'
BEGIN
--implemented ONLY for fixed width
SELECT @Encoding = @Encoding
END
ELSE
BEGIN
--Default to ASCII
SELECT @Encoding = 'ASCII'
END
IF @RowTerminatorType = 'WINDOWS'
BEGIN
SELECT @RowTerminatorType = @RowTerminatorType
END
ELSE
BEGIN
--Default to Unix
SELECT @RowTerminatorType = 'UNIX'
END
IF @FieldTerminatorType = 'COMMA'
BEGIN
SELECT @FieldTerminatorType = @FieldTerminatorType
END
ELSE IF @FieldTerminatorType = 'FIXED'
BEGIN
SELECT @FieldTerminatorType = @FieldTerminatorType
END
ELSE
BEGIN
--Default to PDF
SELECT @FieldTerminatorType = 'PDF'
END
IF @Encoding = 'UTF-8' GOTO utf8
IF @FieldTerminatorType = 'FIXED'
BEGIN
SELECT @Script = 'IF OBJECT_ID(''[#TMP_'+@TableName+'raw]'', ''U'') IS NOT NULL DROP TABLE [#TMP_'+@TableName+'raw]; CREATE TABLE [#TMP_'+@TableName+'raw](rawdata VARCHAR(MAX)); '+CHAR(13)+CHAR(10)
SELECT @Script = @Script+'DECLARE @RowTerminator NVARCHAR(2); '
SELECT @Script = @Script+'DECLARE @CurrentLocation BIGINT; '
SELECT @Script = @Script+'DECLARE @NextLineEnd BIGINT; '
SELECT @Script = @Script+'DECLARE @LoopCount BIGINT; '+CHAR(13)+CHAR(10)
SELECT @Script = @Script+'BULK INSERT [#TMP_'+@TableName+'raw] FROM '''+@FileLocation+''' WITH(CODEPAGE = ''1252'''
SELECT @Script = @SCript +',ROWTERMINATOR='''
IF @RowTerminatorType = 'UNIX' SELECT @Script = @Script + '0X0A'
ELSE IF @RowTerminatorType = 'WINDOWS' SELECT @Script = @Script +''
SELECT @Script = @Script + '''); '
set @RunningStart = 1
set @FieldList = 'INSERT INTO ' + @Schema+'.['+@TableName+ '] ('
set @ValuesList = ') Select '
DECLARE column_cursor CURSOR FOR
SELECT sys.columns.name AS ColumnName, sys.types.name AS ColumnType, sys.columns.max_length AS ColumnLength, sys.columns.precision AS ColumnPrecision, sys.columns.scale AS ColumnScale, sys.columns.column_id AS ColumnID
FROM sys.columns INNER JOIN sys.types on sys.columns.user_type_id = sys.types.user_type_id
inner join sys.tables on sys.columns.object_id = sys.tables.object_id
INNER JOIN sys.schemas ON sys.schemas.schema_id = sys.tables.schema_id
WHERE sys.tables.name = @TableName AND sys.schemas.name = @Schema
ORDER BY sys.columns.column_id
--Loop through cursor
OPEN column_cursor
FETCH NEXT FROM column_cursor
INTO @ColumnName,@ColumnType,@ColumnLength,@ColumnPrecision,@ColumnScale,@ColumnID
WHILE @@FETCH_STATUS = 0
BEGIN
set @fieldlist = @FieldList + '[' + @ColumnName + '] , '
if @ColumnType like '%CHAR'
begin
set @Work = '(SUBSTRING(rawdata,' + cast(@RunningStart as varchar) + ', '
+ CAST( @ColumnLength as varchar) + ') ) as [' + @ColumnName + '] , '
set @RunningStart = @RunningStart + (@ColumnLength)
end
else IF @ColumnType like 'datetime%'
BEGIN
set @Work = 'CAST((CASE WHEN SUBSTRING(rawdata,' + cast(@RunningStart as varchar) + ', '
+ CAST( @DateLength as varchar) + ') = ''00000000'' THEN NULL ELSE SUBSTRING(rawdata,' + cast(@RunningStart as varchar) + ', '
+ CAST( @DateLength as varchar) + ') END) AS '+@ColumnType+') as [' + @ColumnName + '] , '
set @RunningStart = @RunningStart + 8
END
ELSE
begin
set @Work = 'CAST((CAST(SUBSTRING(rawdata,' + cast(@RunningStart as varchar) + ', '
IF @ColumnScale = 0 OR @ColumnType = 'Money' OR @ColumnType LIKE '%time%'
BEGIN
SELECT @Work = @Work+CAST( (@ColumnPrecision) as varchar)
SELECT @Work = @Work + ') AS VARCHAR(MAX))) AS '+@ColumnType+'('+CAST(@ColumnPrecision AS VARCHAR)+')) as [' + @ColumnName + '] , '
set @RunningStart = @RunningStart + @ColumnPrecision
END
ELSE
BEGIN
SELECT @Work = @Work+CAST( (@ColumnPrecision) as varchar)
SELECT @Work = @Work + ')AS NUMERIC('+CAST((@ColumnPrecision+@ColumnScale) AS VARCHAR)+',0))/'+CAST((POWER(10,@ColumnScale)) AS VARCHAR)+') AS '+@ColumnType+'('+CAST(@ColumnPrecision AS VARCHAR)+','+CAST(@ColumnScale AS VARCHAR)+')) as [' + @ColumnName + '] , '
set @RunningStart = @RunningStart + @ColumnPrecision
END
end
set @ValuesList = @ValuesList + @Work
FETCH NEXT FROM column_cursor
INTO @ColumnName,@ColumnType,@ColumnLength,@ColumnPrecision,@ColumnScale,@ColumnID
END
CLOSE column_cursor
DEALLOCATE column_cursor
set @FieldList = substring(@fieldlist,1, len(@fieldlist)-2)
set @ValuesList = substring(@valueslist,1, len(@valueslist)-2)
set @Script = @Script +CHAR(13)+CHAR(10)+ @FieldList + @ValuesList + ' FROM [#TMP_'+@TableName+'raw]; '
SELECT @Script = @Script + 'DROP TABLE [#TMP_'+@TableName+'raw]; '
END
ELSE
BEGIN
SELECT @TableSysID = t.OBJECT_ID FROM sys.tables t INNER JOIN sys.schemas s ON s.schema_id = t.schema_id WHERE t.name=@TableName AND s.name=@Schema
SELECT @Script = 'IF OBJECT_ID(''[#TMP_'+@TableName+']'', ''U'') IS NOT NULL DROP TABLE [#TMP_'+@TableName+']; CREATE TABLE [#TMP_'+@TableName+'] ('+CHAR(13)+CHAR(10)
-- Cursor to create temp table
DECLARE column_cursor CURSOR LOCAL READ_ONLY FOR
SELECT sys.columns.name AS ColumnName, sys.types.name AS ColumnType, sys.columns.max_length AS ColumnLength, sys.columns.precision AS ColumnPrecision, sys.columns.scale AS ColumnScale, sys.columns.column_id AS ColumnID
FROM sys.columns INNER JOIN sys.types on sys.columns.user_type_id = sys.types.user_type_id
WHERE object_id = @TableSysID
ORDER BY sys.columns.column_id
--Loop through cursor
OPEN column_cursor
FETCH NEXT FROM column_cursor
INTO @ColumnName,@ColumnType,@ColumnLength,@ColumnPrecision,@ColumnScale,@ColumnID
WHILE @@FETCH_STATUS = 0
BEGIN
--Dates need to have 0s converted to NULL
IF @ColumnType = 'datetime' OR @ColumnType = 'datetime2'
BEGIN
--For Date to null conversions in temp table
SELECT @Script = @Script+@ColumnName+' CHAR('+CAST(@DateLength AS VarChar)+')'
END
ELSE IF @ColumnPrecision <> 0
BEGIN
IF @ColumnScale = 0 OR @ColumnType = 'Money' OR @ColumnType LIKE '%time%'
BEGIN
--Numbers like INT and bit with no scale
SELECT @Script = @Script+@ColumnName+' '+@ColumnType
END
ELSE
BEGIN
--Numbers like numeric
SELECT @Script = @Script+@ColumnName+' '+@ColumnType+'('+CAST(@ColumnPrecision AS VarChar)+','+CAST(@ColumnScale AS VarChar)+')'
END
END
ELSE
BEGIN
--Text
IF @ColumnLength > 0
BEGIN
--Standard text types
SELECT @Script = @Script+@ColumnName+' '+@ColumnType+'('+CAST(@ColumnLength AS VarChar)+')'
END
ELSE
BEGIN
--Special text types like XML or geography
SELECT @Script = @Script+@ColumnName+' '+@ColumnType
END
END
IF @@FETCH_STATUS = 0
BEGIN
SELECT @Script = @Script+','+CHAR(13)+CHAR(10)
END
FETCH NEXT FROM column_cursor
INTO @ColumnName,@ColumnType,@ColumnLength,@ColumnPrecision,@ColumnScale,@ColumnID
END
CLOSE column_cursor
DEALLOCATE column_cursor
SELECT @Script = @Script+')'
--Add Bulk Insert Statement
SELECT @Script = @Script+CHAR(13)+CHAR(10)+'BULK INSERT [#TMP_'+@TableName+']'
SELECT @Script = @Script+' FROM '''+@FileLocation+''''
SELECT @Script = @Script+' WITH (CODEPAGE = 1252,FIELDTERMINATOR='''
IF @FieldTerminatorType = 'PDF' SELECT @Script = @Script + '|'
ELSE IF @FieldTerminatorType = 'COMMA' SELECT @Script = @Script + ','
SELECT @Script = @SCript +''',ROWTERMINATOR='''
IF @RowTerminatorType = 'UNIX' SELECT @Script = @Script + '0X0A'
ELSE IF @RowTerminatorType = 'WINDOWS' SELECT @Script = @Script +''
SELECT @Script = @Script + ''')'
--SELECT @Script = @Script + ' GO'+CHAR(13)+CHAR(10)
--Do Date Updates
-- Cursor to create temp table
DECLARE column_cursor CURSOR LOCAL READ_ONLY FOR
SELECT sys.columns.name AS ColumnName,sys.types.name AS ColumnType
FROM sys.columns INNER JOIN sys.types on sys.columns.user_type_id = sys.types.user_type_id
WHERE object_id = @TableSysID
--Loop through cursor
OPEN column_cursor
FETCH NEXT FROM column_cursor
INTO @ColumnName, @ColumnType
WHILE @@FETCH_STATUS = 0
BEGIN
IF @ColumnType = 'datetime' OR @ColumnType = 'datetime2'
BEGIN
--For Date to null conversions in temp table
SELECT @Script = @Script+' UPDATE [#TMP_'+@TableName+'] SET ['+@ColumnName+']=NULL WHERE ['+@ColumnName+']=''00000000'' OR LEN(RTRIM([' + @ColumnName + '])) = 0'+CHAR(13)+CHAR(10)
END
FETCH NEXT FROM column_cursor
INTO @ColumnName, @ColumnType
END
CLOSE column_cursor
DEALLOCATE column_cursor
--
--SELECT @Script = @Script+' UPDATE #TMP_'+@TableName+' SET '+@ColumnName+'=NULL WHERE '+@ColumnName+'=''00000000'''+CHAR(13)+CHAR(10)
SELECT @Script = @Script+' '+REPLACE(@ImportCommands,'?','[#TMP_'+@TableName+']')+CHAR(13)+CHAR(10)
--
SELECT @Script = @Script+' INSERT INTO '+@Schema+'.['+@TableName+'] SELECT * FROM [#TMP_'+@TableName+']'+CHAR(13)+CHAR(10)
SELECT @Script = @Script+' DROP TABLE [#TMP_'+@TableName+']'+CHAR(13)+CHAR(10)
--SELECT @Script = @Script+' GO'
END
GOTO return_script
utf8:
return_script:
-- Return the result of the function
RETURN @Script
END
GO
November 20, 2012 at 1:52 pm
Oh, I neglected to mention that the data files I import don't have decimals (the decimal is implied based off of the field precision\scale) so.... it might not work if your numbers have decimals although, you could just make all fields VARCHAR for your destination table... that would work fairly well.
November 20, 2012 at 2:00 pm
Thank you all. I think with the responses so far I can get this working.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply