This post is part of a series on this blog that will help me, and hopefully you, pass exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012. So far, we’ve covered:
- Tables and Schemas in the data warehouse
- Dimensions and Slowly Changing Dimensions
- Fact tables and measures
- Intro to columnstore indexes
- Columnstore indexes and partitioning
- Introduction to SQL Server Data Tools (SSDT)
- Connection Managers
- Control Flow Tasks and Containers
The most critical part of any SSIS project is the data flow. It is, after all, the heart of your ETL process. In this post, we begin looking at the components of the data flow more closely. We’ll start with the data flow source adapters.
Data Flow Source Adapters
The data flow source adapter is used to pull data out of a source repository for use within your package. There are several source adapters to choose from, depending on where your source data resides.
- ADO NET source – allows connectivity to a data source using an ADO.NET provider. For example, I’ve used this adapter to pull data out of MySQL databases.
- CDC source – introduced in SQL Server 2012, this adapter allows you to pull changed data out of a Change Data Capture (CDC)-enabled table. The CDC source uses an ADO .NET connection manager.
- Excel source – extract data from an Excel spreadsheet
- Flat File source – pull data from a delimited or fixed-width file
- ODBC source – connect to a source using native ODBC
- OLE DB source – connect to OLE DB providers like SQL Server and Oracle (as mentioned in a previous post, Microsoft is moving away from OLE DB and recommends using ODBC sources instead)
- Raw File source – raw files are native files generated by SSIS using the Raw File destination. The Raw File source reads in these files very quickly with little parsing or translation required.
- XML source – extract data from an XML file using a defined XML schema
Configuring the Source Adapter
When you add a data source adapter, you’ll need to configure it. This usually entails specifying a connection manager, an access mode (table or view, SQL command, etc.), columns, and any other properties that are specific to the selected adapter.
Let’s look more closely at the ODBC Source adapter as an example. If I add the adapter to my data flow and double-click it to edit, the first page I see is the Connection Manager page. Here I can select an ODBC connection manager that I’ve already created in my package, or create a new one. For the data access mode, I have a choice of a table or a SQL Command. Depending on which mode I select, I would then specify a table/view (views don’t seem to show up in the drop down list, although you can manually enter the view name) or type in my SQL query.
Once I’ve specified my source, I can then move to the Columns page to specify which columns I want to pull. It’s a best practice to limit the columns to those you actually need to extract, to avoid consuming unnecessary resources.
Further configuration can be performed using the Advanced Editor (right-click on your source and select “Show Advanced Editor…” In the Component Properties tab you can configure the same properties you would see in the properties window, including the ValidateExternalMetadata and BatchSize properties. ValidateExternalMetadata tells SSIS whether to validate the data source’s metadata at design time or wait until runtime. Setting this property to false is useful when the specified source table doesn’t exist now, but will be created by the package itself. The BatchSize property specifies the number of rows per batch.
On the Input and Output Properties tab of the Advanced Editor, you can see and change how SSIS is mapping the source data types to SSIS data types. By clicking on the ODBC Source Output node, you can also set the IsSorted property to True if you’re using a source query with an ORDER BY clause. This will help eliminate the need to sort data inside your SSIS package when using certain transformations (but more on that in a later post).
Summary
As with everything in SSIS, it’s important to use the right tool for the job. While you may be able to use the ADO .NET, OLE DB, or ODBC source for a particular connection, each offers slightly different capabilities. Use the appropriate source for your specific scenario. Next time, we’ll talk about the other side of the ETL equation: the destination.