Loading a
series of flat files
Introduction
I recently
faced a requirement to load a number of flat files to a SQL Server 2000 database
on a nightly basis. The customer also wanted the ability to load a single file
as an ad-hoc request. The files were placed within a directory on the database
server by a third party piece of software. I have seen a number of questions to
deal with similar tasks in threads on this site, here is how I solved this task
using the extended stored procedures xp_DirTree, xp_FileExist and xp_CmdShell
(which are available within the master database although the first two are
undocumented in Books Online) together with a DTS package. My original solution
included various error logging that I have omitted here for brevity.
Methodology
The first
stored procedure is sp_LoadAll the purpose of which is to iterate through all
files in a specified directory after first validating that the directory exists.
For each file within the directory the stored procedure sp_LoadSingle is called
to load the file. To schedule the loading of files on a nightly basis this
stored procedure would be called from a SQL Server job. The methods to retrieve
the files and validate the directory exists are outlined further on.
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS OFF
GO
if exists (select * from
dbo.sysobjects where id = object_id(N'[dbo].[sp_LoadAll]') and OBJECTPROPERTY(id,
N'IsProcedure') = 1)
drop procedure [dbo].[sp_LoadAll]
GO
--
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-- Name :
sp_LoadAll
--
-- Parameters: None
--
-- Returns :
See returns from sub-procedures.
--
-- Description :
Loads all files in a specified directory.
--
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
create procedure sp_LoadAll
AS
set nocount on
--Variables
declare @rtn int
,
@Folder varchar(1000)
, @File
varchar(1000)
--Assign defaults
select @rtn =
0 --Executed with no errors.
--Assign the folder
select @Folder =
'c:\Load\'
--Does the folder
exist
exec
@rtn =
sp_FolderFileExist
@File =
@Folder
, @Switch
= 0
if @rtn !=
0 --Error
begin
goto
QuitWithErrLog
end
--Get the list of
files
exec
@rtn =
sp_GetFiles
@Folder
= @Folder
if @rtn !=
0 --Error
begin
goto
QuitWithErrLog
end
--Load the files.
declare cFiles cursor
local fast_forward for select @Folder + [file] from ##Files order by [file]
open cFiles
FetchNext:
fetch next from cFiles
into @File
while @@Fetch_Status =
0
begin
exec
@rtn
= sp_LoadSingle
@File = @File
goto
FetchNext
end
--Clean up cursor
close cFiles
deallocate cFiles
Quit:
set nocount off
return @rtn
QuitWithErrLog:
--*** Log errors here
***
goto Quit
GO
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS ON
GO
The second stored procedure is
sp_LoadSingle, this is called by sp_LoadAll for each file to load and it is also
called when one file is required to be loaded as an ad-hoc task. This procedure
firstly validates that the file to be loaded and the DTS package (stored as a
structured storage file) exist.
After this the data is ready to
be loaded, I think the neatest way to do this is to firstly load the data into a
holding table, perform all validation via TSQL queries and once successfully
validated, load the data into the live table(s). The holding table is simply a
series of varchar columns and should contain the data exactly as it appears
within the flat file i.e. no validation or filtering should be performed by the
DTS package its only purpose is to load the data into the holding table via a
data pump task.
The DTS package is
called via the extended stored procedure xp_CmdShell and the command line
utility Dtsrun (type Dtsrun /? At the command line for info.). It contains
global variables for the filename to be loaded, the server name and database
name, these are passed to the pacakge via switches in the Dtsrun command, see
below.
I have omitted the
stored procedure sp_Validate as the content of this depends on the requirements.
Some I faced were data being too long i.e. 6 characters in a field when only 5
were allowed, data type validation i.e. char data in numeric fields and a total
record count – the header record of each file contains a total record count
which had to match the actual record count. The customer also wanted a log
(within a SQL server table) of the exact field within the file in which the
error occurred together with the error. All this is achieved via TSQL statements
within the stored procedure sp_Validate.
The definitions of the two
tables used in this example are: -
create table [dbo].[Holding]
(
[c1] varchar(4)
, [c2] varchar(4)
, [c3] varchar(4)
)
on [primary]
go
create table [dbo].[Live]
(
[c1] int
, [c2] int
, [c3] int
)
on [primary]
go
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS OFF
GO
if exists (select * from
dbo.sysobjects where id = object_id(N'[dbo].[sp_LoadSingle]') and
OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[sp_LoadSingle]
GO
--
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-- Name :
sp_LoadSingle
--
-- Parameters :
@File varchar(1000) --The full path and file name.
--
-- Returns :
See returns from sub-procedures.
--
-- Description :
Loads a single file
--
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
create procedure sp_LoadSingle
@File
varchar(1000) --The full path and file name.
AS
set nocount on
--Variables
declare @rtn int
--Int to return
, @DTSPath
varchar(1000)
, @cmd
varchar(1000)
select @rtn =
0 --Executed with no errors.
--Validate the file
exists.
exec
@rtn =
sp_FolderFileExist
@File =
@File
, @Switch
= 1
if @rtn != 0
begin
goto
QuitWithErrLog
end
--Assign the full path
and filename of the ci.DTS package
select @DTSPath =
'c:\DTS\Load.DTS'
--Does CI.DTS exist on
the file server?
exec
@rtn =
sp_FolderFileExist
@File = @DTSPath
, @Switch
= 1
if @rtn != 0
begin
goto
QuitWithErrLog
end
--Clear the holding
table
truncate table Holding
--Load the file.
select @cmd = 'Dtsrun
/f "' + @DTSPath + '" /A File:8="' + @File + '" /A Server:8="' + @@servername +
'" /A Database:8="' + db_name() + '" /W TRUE'
exec
master..xp_cmdshell @cmd
--Validate the data
-- exec
-- @rtn =
sp_Validate
--
-- if @rtn != 0
-- begin
-- goto
QuitWithErrLog
-- end
--Load the data into
live
insert Live
(
[c1]
, [c2]
, [c3]
)
select
[c1]
, [c2]
, [c3]
from
Holding
Quit:
set nocount off
return @rtn
QuitWithErrLog:
--*** Log errors here
***
goto Quit
GO
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS ON
GO
Below is an
overview of the DTS package used to load the data. There is a database
connection and a flat file connection, a dynamic properties task assigns the
global variables (file connection and the server and database for the database
connection) then a data pump task loads the data.
The stored
procedure sp_GetFiles accepts a directory as a parameter and then populates the
temporary table ##Files with all the files within that directory via the
extended stored procedure xp_DirTree.
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS OFF
GO
if exists (select * from
dbo.sysobjects where id = object_id(N'[dbo].[sp_GetFiles]') and
OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[sp_GetFiles]
GO
--
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--
-- Name :
sp_GetFiles
--
-- Parameters:
@Folder
--
-- Returns : -1
No files to process.
--
-- Description :
Populates ##Files with all files in a specified directory
--
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
create procedure sp_GetFiles
@Folder
varchar(1000)
AS
--Variables
declare @rtn int
, @rowcount
int
select @rtn =
0 --Executed with no errors.
--Does the folder
exist
exec
@rtn =
sp_FolderFileExist
@File =
@Folder
, @Switch
= 0
if @rtn != 0
begin
goto Quit
end
--Drop the temp tables
to remove table and data.
if exists (select *
from tempdb.dbo.sysobjects where name = '##DirTree')
begin
drop table
##DirTree
end
if exists (select *
from tempdb.dbo.sysobjects where name = '##Files')
begin
drop table
##Files
end
--Create the temp
tables.
create table ##DirTree
(
[subdirectory]
varchar(255)
, [depth] int
, [file] int
)
create table ##Files
(
[File]
varchar(1000)
)
--Populate ##DirTree
insert ##DirTree
exec
master..xp_dirtree
@Folder
, 1
, 1
--Populate
##Files
insert ##Files
(
[File]
)
select
subdirectory
from
##DirTree
where
[file] =
1
select @rowcount = @@rowcount
if @rowcount = 0
begin
select @rtn
= -1
end
drop table ##DirTree
Quit:
return @rtn
GO
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS ON
GO
The stored
procedure sp_FolderFileExist validates if a folder or a file exists on the
database server via the extended stored procedure xp_FileExist. There are two
parameters, @File which is the folder or file to validate and @Switch, 0 for a
folder, 1 for a file.
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS OFF
GO
if exists (select * from
dbo.sysobjects where id = object_id(N'[dbo].[sp_FolderFileExist]') and
OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[sp_FolderFileExist]
GO
--
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-- Name :
sp_FolderFileExist
--
-- Parameters :
@File, @Switch
--
-- Returns : -1
= Folder or file does not exist.
--
-- Description :
Does a folder or a file exist.
--
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
create procedure
sp_FolderFileExist
@File
varchar(1000) --The folder or file name.
, @Switch
bit --0 = Folder; 1 = File
as
--Variables
declare @rtn int
select @rtn =
0 --Executed with no errors.
--Create a temp table.
create table #FolderFileExists
(
[File
Exists] int
, [File is a
Directory] int
, [Parent Directory
Exists] int
)
insert #FolderFileExists
exec
master..xp_FileExist
@File
if not exists
(
select *
from
#FolderFileExists
where
case @Switch
when 0
then [File is a Directory]
when 1
then [File Exists]
end = 1
)
begin
select @rtn
= -1
end
Quit:
return @rtn
GO
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS ON
GO
Conclusion
There are a
number of different ways to tackle this problem. A couple of which are: -
Iterate
through the files using VB and the filesystemobject calling a DTS package via
the COM interface to load the data.
Loop
through the files within a DTS package calling another DTS package to load the
data.
I liked this
method because it keeps everything within the database (except for the DTS
package) making development and installation simpler. Has anyone tackled a
similar problem in a different / better way, please let me know.