As SQL Server DBAs/Developers, we are often asked to import data from Excel spreadsheets into database tables. There are many different ways to achieve this:
- SSIS: use data flow task and chose Excel as data source;
- Ad hoc queries: use OPENROWSET/OPENDATASOURCE or Linked server with Microsoft Jet OLEDB provider;
- Excel COM methods: Use sp_OA family stored procedures to call the Excel COM methods
- Bulk load: Save the Excel spreadsheets as CSV files and use BCP or BULK INSERT
However, we sometimes still face challenges:
· There is no 64 bit Microsoft OLE DB Provider for Jet. The provider is required for linked server or OPENROWSET/OPENDATASOURCE etc to import Excel spreadsheets, Access tables, CSV files etc. Scripts, like the following one, used to work in SQL 32bit don’t work anymore in 64bit SQL Server:
SELECT* FROMOpenDataSource('Microsoft.Jet.OLEDB.4.0', 'Data Source="C:\TestExcel.xls";User ID=Admin;Password=;Extended properties="Excel 8.0;IMEX=1;HDR=Yes"')...[Sheet1$]
List 1
- Sometimes we need to handle dynamic spreadsheets, i.e. we don’t know the number of columns and their names in ahead, which is not easy to handle in SSIS or adhoc queries;
- Sometimes the security policy does not allow us to use sp_OA family stored procedures or OPENROWSET/OPENDATASOURCE adhoc queries. They are turned off by administrators.
This article discusses a way to import Excel spreadsheet by native T-SQL of SQL Server 2005 or later, so it works well in both 32bit and 64bit platforms.
Overview
Since 2003 MS Office supports XML spreadsheet, i.e. an Excel workbook can be saved as an xml document. The xml spreadsheet has “Excel.Sheet” as processing instruction so when you double click it, it will be opened in Excel instead of internet explorer or other xml processor.
<?xml version="1.0"?> <?mso-application progid="Excel.Sheet"?> <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
List 2
The basic idea is to save Excel spreadsheets as xml spreadsheets.
- Use the BULK rowset provider of OPENROWSET to load the xml file to SQL Server;
- Process the loaded xml to parse the header and values for each cell;
- Pivot the cells and output the data table;
A stored procedure is provided as a generic tool to import Excel xml spreadsheet:
[dbo].[uspImportExcelSheet]
@ExcelFileName nvarchar(255),
@WorkSheetName nvarchar(255),
@OutputTableName nvarchar(257)=N'',
@FirstRowIsHeader bit=0,
@IdentityColumn sysname=N'',
@FirstRowToCheckDataType int=1,
@RowsToCheckDataType int=8,
@FirstRowToImportData int=0,
@RowsToImport int=0,
@Debug tinyint=0
Parameters:
@ExcelFileName | The Excel workbook file full path name (UNC format for remote file). It must be saved in xml format |
@WorkSheetName | The name of the worksheet to import |
@OutputTableName | The output data table. If empty string, return a result set. If specified, the table can either be pre-created or let the stored procedure to create, in this case, the table must be a global temporary table or a physical table. |
@FirstRowIsHeader | 1, The first row in the worksheet is header; 0, no headers; Default is 0; |
@IdentityColumn | Only applicable when @OutputTableName is specified and the table is not pre-created. Add the identity column with the specified name when creating the table. Default is empty string; |
@FirstRowToImportData | The row number starting to import the data. Default is @FirstRowIsHeader+1 |
@RowsToImport | Number of rows to import the data. Default is 0 (all rows except the header) |
@FirstRowToCheckDataType | The row number starting to check the data type. Default is @FirstRowToImportData. Not applicable if @OutputTableName is pre-created. |
@RowsToCheckDataType | Number of rows to read to check the data type. Default is 8. Not applicable if @OutputTableName is pre-created. |
@Debug | Whether to execute the SP in debug mode, which will output intermediate results. 0, no; 1, yes; |
BULK load xml spreadsheet to SQL Server
Since SQL Server 2005, we can use OPENROWSET with the BULK rowset provider to load a file from OS into SQL Server. The file can be on the SQL Server local machine or a remote machine. The following script loads the “TestBook.xml“file from a shared folder “Files” on the machine “RemoteMachine”
SELECTCAST(BulkColumn as xml)FROM OPENROWSET (BULK'\\RemoteMachine\Files\TestBook.xml', SINGLE_BLOB) A
The file name must be a constant in the OPENROWSET BULK function. Since we need to import Excel files with any name, we can use dynamic SQL:
DECLARE@WS xml, @SQL nvarchar(2000), @ExcelFileName nvarchar(255); SET@WS=NULL; SET@SQL=N'SET @WorkSheet=NULL; SELECT @WorkSheet=CAST(BulkColumn as xml) FROM OPENROWSET (BULK '''+@ExcelFileName+N''', SINGLE_BLOB ) A;' execsp_executesql @SQL,N'@WorkSheet xml OUTPUT',@WorkSheet=@WS OUTPUT; SELECT@WS;
List 3
Note: The OPENROWSET BULK is not an adhoc distributed query and it is not controlled by the surface area “Ad Hoc Distributed Queries”. So it still works after “Ad Hoc Distributed Queries” is turned off by:
sp_configure'show advanced options',1 RECONFIGURE sp_configure'Ad Hoc Distributed Queries',0 RECONFIGURE
Parse and Shred the XML
First let’s see an example, for the spreadsheet in table 1, part of the xml is in list 2.
Table 1: Sample xml spreadsheet
A | B | C | D | |
1 | TextCol | NumCol | DateCol | Text2Col |
2 | AAA | 10.005 | 10/1/2009 1:00 | Test1 |
3 | BBB | Test2 | ||
4 | CCC | 30 | 10/3/2009 3:00 | |
5 | 40 | Test4 | ||
6 | 12/3/2009 3:00 | Test5 |
We can see that MS office has very flexible schema for XML spreadsheet:
- If a cell does not have value, the cell is not present in the xml, e.g. B3 in table 1.
- For a cell with value, if it is the first cell, or the previous cell has value too, the attribute “ss:Index“ for the element “Cell” does not present.
- “ss:Index“ only presents for cells with value when the previous cell does not have value, e.g. the xml piece in red in list 4.
<Table ss:ExpandedColumnCount="4" ss:ExpandedRowCount="6" x:FullColumns="1" x:FullRows="1" ss:DefaultRowHeight="15"> <Column ss:Index="2" ss:Width="48.75"/> <Column ss:Width="78"/> <Row ss:AutoFitHeight="0"> <Cell><Data ss:Type="String">TextCol</Data></Cell> <Cell><Data ss:Type="String">NumCol</Data></Cell> <Cell><Data ss:Type="String">DateCol</Data></Cell> <Cell><Data ss:Type="String">Text2Col</Data></Cell> </Row> <Row ss:AutoFitHeight="0"> <Cell><Data ss:Type="String">AAA</Data></Cell> <Cell><Data ss:Type="Number">10.005000000000001</Data></Cell> <Cell ss:StyleID="s62"><Data ss:Type="DateTime">2009-10-01T01:00:00.000</Data></Cell> <Cell><Data ss:Type="String">Test1</Data></Cell> </Row> <Row ss:AutoFitHeight="0"> <Cell><Data ss:Type="String">BBB</Data></Cell> <Cell ss:Index="3" ss:StyleID="s62"/> <Cell><Data ss:Type="String">Test2</Data></Cell> </Row> <Row ss:AutoFitHeight="0"> <Cell><Data ss:Type="String">CCC</Data></Cell> <Cell><Data ss:Type="Number">30</Data></Cell> <Cell ss:StyleID="s62"><Data ss:Type="DateTime">2009-10-03T03:00:00.000</Data></Cell> </Row> <Row ss:AutoFitHeight="0"> <Cell ss:Index="2"><Data ss:Type="Number">40</Data></Cell> <Cell ss:Index="4"><Data ss:Type="String">Test4</Data></Cell> </Row> <Row ss:AutoFitHeight="0"> <Cell ss:Index="3" ss:StyleID="s62"><Data ss:Type="DateTime">2009-12-03T03:00:00.000</Data></Cell> <Cell><Data ss:Type="String">Test5</Data></Cell> </Row> </Table>
List 4
Get Row Number and Column Number: xquery is used to get the row number and column number for the specified spreadsheet "@WorkSheetName".
;WITHXMLNAMESPACES(DEFAULT'urn:schemas-microsoft-com:office:spreadsheet', 'urn:schemas-microsoft-com:office:office'AS o, 'http://www.w3.org/TR/REC-html40' AS html, 'urn:schemas-microsoft-com:office:excel'AS x, 'urn:schemas-microsoft-com:office:spreadsheet'AS ss) SELECT @ColNum=R.c.value(N'@ss:ExpandedColumnCount','int'), @RowNum=R.c.value(N'@ss:ExpandedRowCount','int') FROM@WS.nodes('Workbook/Worksheet[@ss:Name=sql:variable("@WorkSheetName")]/Table') R(c);
List 5
To get the headers (column names) of the spreadsheet:
The first row can be the headers instead of data in a spreadsheet. Xquery is used to read the value for each column for the first row.
To parse and shred the xml, OPENXML is used:
- OPENXML has better performance than xquery on large xml ;
- Row order in the xml is enforced by the position of the <Row> element. However we cannot get the value of the position() function in SQL Server xquery, it can only be used in predicts (please check this MS connect). If any cell does not have value, it will be hard to assign row number and column number (CellIndex) to cells with values. OPENXML has metadata (unique ID and parent ID) for each node in the xml. We can use the IDs to associate the cells with the rows.
List 6 is the code to shred the xml:
- Raw cell data is saved in a temporary table #Raw$ExcelData
- The xpath in OPENXML does not support the sql:variable() function as xquery does so we need to use dynamic SQL in order to import the specified worksheet name;
- The xpath position() function is used to limit the specified rows to import;
- If we need to detect data type of the columns, all rows are shredded into the raw data table;
- If we do not need to detect data type, only rows specified to import are shredded into the raw data table.
- The id (@mp:id) and parentid (@mp:parentid) of each cell is used to group the cells with the rows and persist the row ordering.
- DENSE_RANK () is used to generate the consecutive Row number based on the parentid metadata, which is not consecutive.
- To prevent potential memory leaking, the xml document is removed immediately after the OPENXML and the handle @iDoc is set to 0; In the CATCH block of the stored procedure, the handle is checked and released if it is not 0.
SET@iDoc=0; EXEC sp_xml_preparedocument @iDoc OUTPUT, @WS,'<Workbook xmlns:P="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40"/>'; SET @SQL=N' INSERT #Raw$ExcelData(Data,CellIndex,RowID) SELECT ISNULL(Data,N''''),ISNULL(CellIndex,0),DENSE_RANK ( ) OVER (ORDER BY RowID)'+ CASE WHEN @DetectDataType=1 THEN N' FROM OPENXML(@iDoc,''P:Workbook/P:Worksheet[@ss:Name="'+@WorkSheetName+N'"]/P:Table/P:Row/P:Cell'',10)' ELSE N' FROM OPENXML(@iDoc,''P:Workbook/P:Worksheet[@ss:Name="'+@WorkSheetName+N'"]/P:Table/P:Row'+ CASE WHEN @FirstRowToImportData>1 OR @LastRowToImportData<@RowNum THEN N'[position()>='+CAST(@FirstRowToImportData as nvarchar(30))+N' and position()<='+CAST(@LastRowToImportData as nvarchar(30))+N']' ELSE N'' END+N'/P:Cell'',10)' END + N' WITH (Data nvarchar(max) ''P:Data[1]'', CellIndex int ''@ss:Index'', RowID int ''@mp:parentid'', CellID int ''@mp:id'') ORDER BY RowID,CellID' IF @Debug=1 PRINT @SQL; exec sp_executesql @SQL,N'@iDoc int',@iDoc=@iDoc; EXEC sp_xml_removedocument @iDoc SET @iDoc=0;
List 6
Table 1: Shredded Raw Data for the Sample xml spreadsheet
RID | Data | CellIndex | RowID | NewCellIndex |
1 | TextCol | 0 | 1 | 1 |
2 | NumCol | 0 | 1 | 2 |
3 | DateCol | 0 | 1 | 3 |
4 | Text2Col | 0 | 1 | 4 |
5 | AAA | 0 | 2 | 1 |
6 | 10.005 | 0 | 2 | 2 |
7 | 2009-10-01T01:00:00.000 | 0 | 2 | 3 |
8 | Test1 | 0 | 2 | 4 |
9 | BBB | 0 | 3 | 1 |
10 | Test2 | 4 | 3 | 4 |
11 | CCC | 0 | 4 | 1 |
12 | 30 | 0 | 4 | 2 |
13 | 2009-10-03T03:00:00.000 | 0 | 4 | 3 |
14 | 40 | 2 | 5 | 2 |
15 | Test4 | 4 | 5 | 4 |
16 | 2009-12-03T03:00:00.000 | 3 | 6 | 3 |
17 | Test5 | 0 | 6 | 4 |
To assign the valid column number (CellIndex):
The column “CellIndex” in table 2 is the value after shredding the spreadsheet in table 1. We need to assign the correct CellIndex as showed in “NewCellindex”. It is a typical T-SQL challenge similar to running total. I use the code snippet in list 7. The worst scenario is that all cells have values and the Update statement will run the same times as the number of the columns in the spreadsheet (I am sure there must be faster ways to do it and this can be a challenge to the readers).
DECLARE@Rows int -- Set the CellIndex to 1 for the first cell for each row UPDATED SET CellIndex=1 FROM#Raw$ExcelData D,( SELECT R.RowID,MIN(R.RID)AS RID FROM #Raw$ExcelData R GROUP BY R.RowID ) F WHERE D.RowID=F.RowID AND D.RID=F.RID AND D.CellIndex=0 SET @Rows=ISNULL(@@ROWCOUNT,0) -- Set the CellInded to be 1 plus the CellIndex of the previous cell WHILE@Rows>0 BEGIN UPDATE D SET CellIndex=P.CellIndex+1 FROM #Raw$ExcelData D,#Raw$ExcelData P WHERE D.RowID=P.RowID AND D.RID=P.RID+1 AND D.CellIndex=0 AND P.CellIndex>0 SET @Rows=ISNULL(@@ROWCOUNT,0) END
List 7
To detect data type of the columns in the spreadsheet:
In the raw cell data temporary table, data column is of nvarchar(max) data type. If @OutputTableName is not provided, the stored procedure outputs all columns as string. The length of the string is determined by @FirstRowToCheckDataType and@RowsToCheckDataType:
· If all cells data length is smaller than 255 characters, the column length is the max length of the cells.
· If any cell data length is greater than 255, the column data type is defined as nvarchar(max).
If @OutputTableName is provided, the data is converted to the data type defined in the table.
Pivot the Cells and Output the Results
The raw cell data is pivoted by means of the PIVOT operator. The pivoted data can be returned to the user or saved to a table.
Import Excel Stored Procedure
The stored procedure dbo.uspImportExcelSheetcan be used to import Excel XML spreadsheet in various scenarios.
1) Import from a spreadsheet and return all columns in a resultset, @FirstRowIsHeader must be 1:
EXEC dbo.[uspImportExcelSheet] @ExcelFileName=N'C:\TestBook.xml', @WorkSheetName=N'Sheet1', @FirstRowIsHeader=1
2) Import from a spreadsheet and save all columns in a specified table. It must be a global temporary table or a physical table. @FirstRowIsHeader must be 1. This is useful if we don’t know the number of columns the spreadsheet has.
IF object_id('tempdb..##Data') ISNOT NULL DROP TABLE ##Data; EXEC dbo.[uspImportExcelSheet] @ExcelFileName=N'C:\TestBook.xml', @WorkSheetName=N'Sheet1', @OutputTableName=N'##Data', @FirstRowIsHeader=1, @IdentityColumn=N'RecordID' SELECT * FROM ##Data IF object_id('tempdb..##Data') ISNOT NULL DROP TABLE ##Data;
3) Import a worksheet with all columns and save the data into the table #Data, which is pre-created. If the @FirstRowIsHeader is 1, the table column name must match the header of the spreadsheet. If @FirstRowIsHeader is 0, the table column and spreadsheet column is mapped by position.
IF object_id('tempdb..#Data')IS NOT NULL DROP TABLE #Data; CREATE TABLE #Data ( TextCol nvarchar(30)NULL, NumCol nvarchar(50)NULL, DateCol nvarchar(50)Null, Text2Col nvarchar(30)NULL ) EXEC dbo.[uspImportExcelSheet] @ExcelFileName=N'C:\TestBook.xml', @WorkSheetName=N'Sheet1', @OutputTableName=N'#Data', @FirstRowIsHeader=0 SELECT *FROM #Data IF object_id('tempdb..#Data')IS NOT NULL DROP TABLE #Data;
4) Import a worksheet with specified columns in the table #Data @FirstRowIsHeader must be 1. Import 2 rows starting from the 3rd row:
IF object_id('tempdb..#Data')IS NOT NULL DROP TABLE #Data; CREATE TABLE #Data ( DateCol nvarchar(50)Null, Text2Col nvarchar(30)NULL ) EXEC dbo.[uspImportExcelSheet] @ExcelFileName=N'C:\TestBook.xml', @WorkSheetName=N'Sheet1', @OutputTableName=N'#Data', @FirstRowIsHeader=1, @FirstRowToImportData=3, @RowsToImport=2 SELECT *FROM #Data IF object_id('tempdb..#Data')IS NOT NULL DROP TABLE #Data;
5) Import a worksheet with all columns and save the returned results into a table variable. Detect data type length by all rows:
DECLARE @Data TABLE ( RowNumber int not null, TextCol nvarchar(30)NULL, NumCol nvarchar(50)NULL, DateCol nvarchar(50)Null, Text2Col nvarchar(30)NULL ) INSERT @Data (RowNumber,TextCol,NumCol,DateCol,Text2Col) EXEC dbo.[uspImportExcelSheet] @ExcelFileName=N'C:\TestBook.xml', @WorkSheetName=N'Sheet1', @FirstRowIsHeader=1, @RowsToCheckDataType=0 SELECT * FROM @Data