Microsoft SQL Server has numerous ways of importing and working with text files including bulk insert, DTS (SQL Server 2000), and SSIS (SQL Server 2005). Each technique has advantages and disadvantages. One of the most overall flexible ways of handling text files is through the use of the OpenDataSource command.
OpenDataSource allows ad hoc access to any OLE DB accessible file, and it offers a great deal of versatility. It will allow full access to files which bulk insert will not process properly, and unlike using DTS or SSIS it can be conducted entirely within T-SQL allows full access to the data without fully importing it into the server as a table. In most instances, bulk insert will be the more efficient technique; however, OpenDataSource will handle situations where bulk insert simply will not work such as reading text files where the number of columns is not equal in each row and it allows temporary access to the information immediately rather than through creating a temporary table.
The basic structure of OpenDataSource in reading a text file is:
select *
from
OpenDataSource('Microsoft.Jet.OLEDB.4.0',
'Data Source = C:\;
Extended Properties = "Text;HDR=NO;"')...test1#txt
The actual OpenDataSource statement is treated as the server that is used in a select statement with the actual file to be read being listed as the table, periods as always serve as placeholders for the qualified table name. Note that the period in the filename is replaced by the octothorpe (#) since periods are already a control character in SQL Server. The data source option lists the directory from which SQL Server should read. The Extended Properties must include the option Text. Optionally, it can include the HDR=NO option, which will instruct SQL Server not to use the first row as the column headers. If HDR = NO is on, then it will use F1 - F... as the column names.
Greater control can be gained through the use of a schema.ini file located in the same directory with the text file being read. The schema.ini permits defining the exact format of the file, including how it is delimited, as well as the data type and column name for every column. Every entry in the schema.ini must begin by specifying the filename in brackets [ ].
Then, it can specify the format with Format = options and specify if the first row has the column names with ColNameHeader =. The most common options for the Format option are TabDelimited, CSVDelimited for comma separated values, Delimited() with the character used to delimit between the parenthesis, or FixedLength. If the FixedLength option is used, the length of each column will be specified with the column information. If a format option is not specified or if there is no schema.ini file, the ISAM will default to comma delimited.
After that, it will specify the column names and data types for the ISAM reader to use. The column information is optional for delimited files, but is required for fixed width files. Each line of column information will be formatted as ColN = ColumnName datatype width n, with the width not being used on delimited files. The datatypes must be one of the Microsoft Jet data types, rather than the SQL Server column types, and would commonly include text for any character data, integer, double for floating point data.
A sample schema.ini entry would look like:
[test1.txt]
Format=TABDelimited
ColNameHeader=False
Col1=num1 text
col2=num2 text
col3=num3 text
Each file can have any number of entries, but each entry must begin with the specific file name. Though the schema.ini must identify the files it will apply to by name, it can be used with any file by dynamically copying the file to the name the schema.ini expects and then deleting it, or a schema.ini file can be created dynamically in a script through use of xp_cmdshell commands for the particular file. The schema.ini file is fully documented on MSDN.
One of the strengths of using OpenDataSource over bulk insert is its ability to handle imperfectly formatted files. For instance, it will read missing columns in the file as nulls rather than exiting out with an error or improperly combining lines of data. For instance, this text file:
a
b b
c c c
with the above select statement and schema.ini file reads in as:
NUM1 | NUM2 | NUM3 |
---|---|---|
a | NULL | NULL |
b | b | NULL |
c | c | c |
under Microsoft SQL Server 2005. Where trying to use bulk insert results in everything being merged into one column line as:
Col1 | Col2 | Col3 |
---|---|---|
a b | b c | c c |
due to the way it reads lines without all columns.
In addition, it will permit insert statements against OpenDataSource with a text file, but it will surround character data with the text delimiter, which by default is the quotation mark ("). This provides one way of appending to the file dynamically. But, it does not allow update or delete.
As with most tools within SQL server, the security settings must be set properly. By default, Ad Hoc Distributed queries are disabled in Microsoft SQL Server, which prevents using OpenDataSource this way. To enable it, an administrator must execute:
GO
EXEC sp_configure 'show advanced option', '1';
GO
Reconfigure
GO
EXEC sp_configure 'Ad Hoc Distributed Queries', '1'
GO
Reconfigure
With its ability to handle files that bulk insert cannot as well as to work with ad hoc queries immediately and to append to the text file, OpenDataSource used with text files is a potent tool for certain instances.
Further Reading:
http://msdn2.microsoft.com/en-us/library/aa276845(SQL.80).aspx - The MSDN general OpenDataSource reference page.
http://msdn2.microsoft.com/en-us/library/ms709353.aspx - The MSDN reference on the schema.ini file.