October 7, 2015 at 2:02 am
HI All,
I'm having a problem with an ETL project I'm working on. The file I receive has multiple sheets, which are all in the same format. I then need to import them into one table on the SQL server.
The problem I'm facing is that the company we receive the file from have a habit of filling "N/A", "Unknown", and other strings in numeric fields. Also some of the sheets won't have a column populated one month, as it's not applicable. This means that, according to the Excel Source, the data type of the column has changed (if they're all NULLs, then Excel suddenly thinks it's a int, not a string, for example).
What I want to therefore do, is use the excel source and treat EVERY column as a Unicode string, and then have a data conversion tasks that makes them the correct data type. Any conversions that fails will be NULL (as I don't want someone's age to be "TBC", and I actually have our Age column as a integer field on the server). When I change the DataType in the External Columns menu though, it changes it back when you open it back up. How do I change the datatypes, and actually get it to save it?
I could change all the output columns to Unicode Strings, but if I recall correctly, I had conversion errors when I tried that (I could be wrong, so correct me. This is a project that got put away for a few months and has worked it way back up to the top of the daisy chain).
Many thanks for your help!
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
October 7, 2015 at 9:45 pm
Try updating the connection string to include "IMEX=1" - this should tell the OLEDB provider to treat all cells as text. FOr more info - see https://www.connectionstrings.com/excel/
October 8, 2015 at 2:00 am
happycat59 (10/7/2015)
Try updating the connection string to include "IMEX=1" - this should tell the OLEDB provider to treat all cells as text. FOr more info - see https://www.connectionstrings.com/excel/
Thanks Happy.
I had already done this, but it was still trying to parse some decimals as Currency, and others as Dates (which was inconsistent depending on which sheet I was getting the data from). I ended up using a Script Component as a source, and creating the connection in there to pull all the data through as a String, and then using a Data Conversion Transformation to get them to the right type.
Excel was a bit janky, and sometimes sent dates as ints, and ints as dates, but some tryparse's and dateadd's/maths fixed that. A struggle, but got the result I wanted.
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
October 8, 2015 at 8:15 pm
I understand your issues - Excel is probably my least preferred data source. Unfortunately, Excel does not really care too much about data types - underneath the wrappers, I suspect that each cell is just a bunch of bytes and the display format is used as the type that people think is stored in the cell.
The oledb provider needs to work with that sort of type-less data and struggles big time. What it actually does is parse the first few rows (by default 8 rows but this is configurable using a registry setting) to try to work out what data type to use. This is why you get different data types for a column depending on what is in the first few rows. Not very useful when SSIS really needs the data type to be the same each time otherwise it gets upset.
October 14, 2015 at 3:23 am
I had the same issue a few weeks ago, the reason is simple,the excel source determines datatypes based on the values in the first 8 rows. IMEX on its own changes nothing about that.
IMEX=1 will only treat the column as text if you have a text value within those 8 rows.
IMEX=0 will take the datatype that seems to be prevalent the most in those 8 rows.
You can change the amount of rows it samples in the registry of the OS but I don't think you ever want to mess with the registry.
I'll post a script during my lunch break that can convert a excell tab into an csv file.
October 14, 2015 at 3:40 am
Ok I've used the following script to convert excel tabs/sheets into csv files.
Use a for each file loop to loop through the excel files,you can use an ado.net for loop to loop through the tabs/sheets.
The script takes the following variables
Read Only
User::CSV_Loc = location where you want the csv file
User::DataFile_Input_Path = location of where the excel file sits
User::Input_Share_Root = root folder of the location
User::SourceFile = name of the excel file
Read Write
User::csv_FileName = name of the csv file
written in Microsoft Visual Basic 2008
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.Data.OleDb
Imports System.IO
.
.
.
Public Sub Main()
Dim sourceFile As String
Dim worksheetName As String
Dim targetFile As String
Dim fileName As String
fileName = Replace(Replace(Dts.Variables("SourceFile").Value.ToString(), "filename", ""), ".xlsx", "")
sourceFile = Dts.Variables("Input_Share_Root").Value.ToString() + Dts.Variables("DataFile_Input_Path").Value.ToString() + Dts.Variables("SourceFile").Value.ToString()
targetFile = Dts.Variables("CSV_Loc").Value.ToString() + fileName + ".csv"
worksheetName = "sheetname"
convertExcelToCSV(sourceFile, worksheetName, targetFile)
Dts.Variables("csv_FileName").Value = "fileName + ".csv"
Dts.TaskResult = ScriptResults.Success
End Sub
Private Shared Sub convertExcelToCSV(ByVal sourceFile As String, ByVal worksheetName As String, ByVal targetFile As String)
Dim strConn As String = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" & sourceFile & "';Extended Properties=""Excel 12.0 Xml;HDR=YES;"""
Dim conn As OleDbConnection = Nothing
Dim wrtr As StreamWriter = Nothing
Dim cmd As OleDbCommand = Nothing
Dim da As OleDbDataAdapter = Nothing
Try
conn = New OleDbConnection(strConn)
conn.Open()
cmd = New OleDbCommand("SELECT * FROM [" & worksheetName & "$]", conn)
cmd.CommandType = CommandType.Text
wrtr = New StreamWriter(targetFile)
da = New OleDbDataAdapter(cmd)
Dim dt As DataTable = New DataTable()
da.Fill(dt)
Dim rowString As String = ""
For z As Integer = 0 To dt.Columns.Count - 1
rowString &= "----;"
Next z
wrtr.WriteLine(rowString)
For x As Integer = 0 To dt.Rows.Count - 1
rowString = ""
For y As Integer = 0 To dt.Columns.Count - 1
rowString &= "" & dt.Rows(x)(y).ToString() & ";"
Next y
wrtr.WriteLine(rowString)
Next x
' MessageBox.Show("Done! Your " & sourceFile & " has been converted into " & targetFile & "")
'' wrtr.Close()
conn.Close()
Catch exc As Exception
' MessageBox.Show("- An unexpected error occure" + vbNewLine + "- Please contact application administrator")
End Try
wrtr.Close()
wrtr.Dispose()
conn.Close()
conn.Dispose()
End Sub
End Class
Quickly following script in Visual C# 2008
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.Text.RegularExpressions;
using System.IO;
.
.
.
public void Main()
{
string sheetName = Dts.Variables["User::SheetName"].Value.ToString();
string pattern = Dts.Variables["User::PatternToMatch"].Value.ToString();
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
Match match = Regex.Match(sheetName, pattern);
//Dts.Variables["User::ProcessTheSheet"].Value = match.Success;
bool a_1 = Regex.IsMatch(sheetName, @"[0-9]");
if (match.Success)
{
if (a_1 == true)
{
Dts.Variables["User::ProcessTheSheet"].Value = true;
Dts.Variables["User::SheetName"].Value = sheetName.Replace("'", "");
// tw.WriteLine("Y:" + Dts.Variables["User::SheetName"].Value);
}
else
{
Dts.Variables["User::ProcessTheSheet"].Value = false;
// tw.WriteLine("N:" + Dts.Variables["User::SheetName"].Value);
}
}
else
{
Dts.Variables["User::ProcessTheSheet"].Value = false;
// tw.WriteLine("N:" + Dts.Variables["User::SheetName"].Value);
}
// tw.Close();
Dts.TaskResult = (int)ScriptResults.Success;
}
This script will give true false when the tab/sheetname matches a pattern,you can use that to filter out the tabs you want.
October 14, 2015 at 4:08 am
Thanks Resender.
I've included the script that one of the guys passed to be as well. I have a feeling he found it only, but there's no annotations to say where from (he would of left them in, so sorry to whoever created it that you don't have your kudos):
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.IO;
using System.Data.SqlClient;
using System.Data.OleDb;
using System.Diagnostics;
#endregion
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
private OleDbDataReader excelReader;
private OleDbConnection excelConnection;
private OleDbCommand excelCommand;
public override void PreExecute()
{
base.PreExecute();
// Open connection
GetDataFromExcelToReader(Variables.FilePath);
}
public override void PostExecute()
{
base.PostExecute();
excelReader.Close();
excelConnection.Close();
}
private void GetDataFromExcelToReader(string p_strFileName)
{
string l_strConnectionString;
if (File.Exists(p_strFileName))
{
string extension = Path.GetExtension(p_strFileName);
extension = extension.Replace(".", "");
if (extension.ToLower() == "xlsx")
{
l_strConnectionString = "Provider=Microsoft.ACE.OLEDB.12.0;" +
"Data Source=" + p_strFileName + ";Extended Properties=\"Excel 12.0;HDR=NO;IMEX=1\";";
}
else
{
l_strConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;" +
"Data Source=" + p_strFileName + ";Extended Properties=\"Excel 8.0;HDR=NO;IMEX=1\";";
}
excelConnection = new OleDbConnection(l_strConnectionString);
excelConnection.Open();
excelCommand = excelConnection.CreateCommand();
excelCommand.CommandText = "SELECT * FROM [" + "RTT$A8:BO" /*Variables.SheetName.ToString()*/ + "]";
excelCommand.CommandType = CommandType.Text;
excelReader = excelCommand.ExecuteReader();
}
}
public override void CreateNewOutputRows()
{
int intInputDate;
DateTime dtOutputDate;
int IntOutputDate;
DateTime dtInputDate;
int intLineNumber = 7;
while (excelReader.Read())
{
intLineNumber++;
ComponentMetaData.FireInformation(0, "RTT - FD", "Reading Line " + intLineNumber.ToString(), string.Empty, 0, true);
RTTBuffer.AddRow();
if (excelReader.FieldCount > 0)
{
if (DateTime.TryParse(excelReader[0].ToString(), out dtInputDate))
{
dtInputDate = DateTime.Parse(excelReader[0].ToString());
IntOutputDate = int.Parse(((dtInputDate - DateTime.Parse("01/01/1900")).TotalDays).ToString());
RTTBuffer.YOA = IntOutputDate.ToString();
}
else
{
RTTBuffer.YOA = excelReader.FieldCount > 0 ? excelReader[0].ToString() : "";
}
}
RTTBuffer.ClaimNo = excelReader.FieldCount > 1 ? excelReader[1].ToString() : "";
RTTBuffer.ClaimNumber = excelReader.FieldCount > 2 ? excelReader[2].ToString() : "";
RTTBuffer.Ref = excelReader.FieldCount > 3 ? excelReader[3].ToString() : "";
RTTBuffer.PolicyNumber = excelReader.FieldCount > 4 ? excelReader[4].ToString() : "";
if (excelReader.FieldCount > 5)
{
if (int.TryParse(excelReader[5].ToString(), out intInputDate))
{
intInputDate = int.Parse(excelReader[5].ToString());
dtOutputDate = DateTime.Parse("01/01/1900");
dtOutputDate = dtOutputDate.AddDays(intInputDate);
RTTBuffer.PolicyInceptionDate = dtOutputDate.ToString();
}
else
{
RTTBuffer.PolicyInceptionDate = excelReader.FieldCount > 5 ? excelReader[5].ToString() : "";
}
}
}
}
}
There's a part commented out in the GetDataFromExcelToReader, so that if you'd like to have the sheet name in a variable, rather than hard coded, you can simply replace that part. The if statement is in there, as the Reader had problems with some integers and Dates (it sometimes processed dates as ints, and ints as dates). Therefore it has a tryparse to check if it comes back as I expect it. The first if statement if hopeing for an Integer, and the second a DateTime.
If you have questions, feel free to ask, and i'll try and answer.
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply