February 26, 2008 at 7:36 am
This may or may not be an odd one. I've been struggling with it for a while, though, so help is appreciated.
I've got a contracts table in my main OLTP database. The contracts table isn't terribly normalized. It contains Customer Name, Customer Address info and Contract financial info. The OLTP contracts table tracks unique records with an surrogate identity key called ContractID. There is no associated CustomerID or AddressID in this table.
I need to separate out all three sets of this data into different tables in my datawarehouse. The customer name will be going to a Customers table, the address info will go to a generic Address table (so we can track historically updated address information) and the financials will be going to a Contracts table.
Now, I want to put the customer name into my Customers table, which has a surrogate CustomerKey as the PK (identity). Then I want to pull this CustomerKey from that table and associated it with the Address information and the Contract information. I know how to separate out the data itself, but I have no idea how, in SSIS, I should go about pulling back the identity values I just created and doing this association.
I'm trying to avoid populating the DW with the ContractID for each of the tables. I'm also trying to avoid populating the OLTP database with two new key values that will only be good for the datawarehouse.
Any thoughts? Or is what I'm asking too difficult?
February 26, 2008 at 8:22 am
Without giving away company specific info, can you provide a simple schema of how the tables relate in your OLTP database and in your datawarehouse? This will help me at least see what you are trying to accomplish.
😎
February 26, 2008 at 9:16 am
Lynn,
Here are the create statements (sanitzed). The OLTP table does contain more columns but I'm only concerned with a small few.
Create Table OLTPContracts (ContractID int identity(1,1), FirstName varchar(25), LastName varchar(25), Address varchar(50), City varchar(30), State char(2), Zipcode varchar(10), Amt money, EstTotalAmt money, ContractDate datetime, Term tinyint, MonthlyPymt smallmoney,
CertNumber varchar(20))
------------
Create Table DWContract (ContractKey int identity(1,1), CertNumber varchar(20), Amt money, EstTotalAmt money, ContractDate datetime, Term tinyint, MonthlyPymt smallmoney)
Create Table DWCustomer (CustomerKey int identity(1,1), FirstName varchar(25), LastName varchar(25), MiddleInitial char(1))
Create Table DWAddress (AddressKey int identity(1,1), Addr1 varchar(50), Addr2 varchar(50), City varchar(30), State char(2), Zipcode varchar(10), phone varchar(17), EntityTypeKey int, EntityTypeNameKey)
Create Table DWEntityTypeName (EntityTypeNameKey int identity(1,1), EntityTypeName varchar(20))
Create Table DWEntity (EntityKey int identity(1,1), CustomerKey int, DealerKey int, ContractKey int, AgentKey int....)
----------------------------
I have several different "entities". Customers who buy, Dealers who sell, Agents who broker the sale, etc. Each of these entities has its own table, but rather than repeat address information in each table, I've created a separate Address table where all address info for every entity is stored. This way, we can track historical address information (in case an address change was made wrongly) or create mailing lists as our needs arrive. Address is rarely reported on, so it makes sense to keep it separated this way.
The DWEntityTypeName defines if the entity in question is a Customer, Dealer or Agent. I use the key for that table in the Address table and in a few other tables.
The DWEntity table is basically one big fact table which connects all the possible combinations of Dealers & Agents & Customers in one place.
I threw those two tables in so you'd have a better understanding of what is higher up on the heirarchy for the Customer & Contract tables.
Does this make sense?
February 26, 2008 at 9:19 am
I should mention, I don't keep AddressKey in the fact table since people can have multiple addresses. Mailing address, street address, etc.
February 26, 2008 at 9:41 am
Based on what I have read, I have another question. How does the OLTP system relate Contracts to Customers since it does not have an apparent FK relationship defined?
Could you also provide a sanitized version of the OLTP Customer and Address tables?
😎
February 26, 2008 at 10:26 am
Lynn,
That's the problem. The OLTP doesn't have a Customers or Address table. All the customer info is stored in the OLTPContract table.
If I had separate tables in the OLTP, I wouldn't have issues translating it to the datawarehouse. Unfortunately, because the OLTP contains all that information in one table, and I'm breaking it out for the datawarehouse, I only have the ContractID to use as a key for any of them.
February 26, 2008 at 10:32 am
Okay, that helps. But to paraphase a bit to be sure:
Contract is being split into Customer, Address, and Contract in the DW.
Is this correct?
Edit:
Also, can a single customer have more than 1 contract?
😎
February 27, 2008 at 4:21 am
Yes, that is correct.
And yes, a customer can have more than one contract.
February 27, 2008 at 6:20 am
Okay. And, tell me if I am wrong, this also leads to another potential problem in the data. I could be in there multiple times as Lynn Pettis, Lynn A Pettis, L Pettis and with multiple addresses; 123 Woodstock St, 123 Wood Stock St, 123 Woodstock Street,..., correct?
Well I have enough info to start looking at how I'd map the data to the three tables. I'll work on it around my normal work and see what I come up with.
😎
February 27, 2008 at 6:45 am
You've hit the nail pretty much on the head.
We have two OLTP systems where from our SQL Server we export the data to the mainframe. The mainframe system has customers & contracts that SS doesn't have, because the mainframe services a number of divisions that the SS doesn't. However, there is some overlap. And because it's a mainframe, characters that work in SQL have to be scrubbed out of our export so they would be rejected in the mainframe.
So now I have to pull data from both the mainframe and our SQL Server database in this DW.
And this gets really fun when we get to the common names like... John Smith. Do you know how many John Smiths there are on this planet? :crazy:
February 27, 2008 at 7:48 am
More than one. You also have the variations on the name as well: John Smithe for example. Is it really Smithe or is it a typo and should really be Smith? Especially true if Smithe and Smith have the same address.
Not sure how much time I will be able to spend on this during the day (but I will look at it some as I have ideas popping in my head), but I definately have more time this evening.
😎
February 27, 2008 at 7:55 am
You mentioned pulling the data from both SQL Server and the mainframe, if the layout of the data from the mainframe the same as in SQL Server?
😎
February 28, 2008 at 8:08 am
The mainframe data for customer names is close. Firstname is 10 characters, LastName is 15 characters and they keep middle initials (1 character) that we don't track in SQL.
I came up with a thought this morning. Maybe I should add package variable for the LastCustKey, then populating it with a T-SQL Statement (i.e., Select Max(CustomerKey) from DwCustomers) then doing some sort of global temp table or ADO recordset where I pre-populate the Key. At which point, when I go to throw the records in the DW, I turn on Identity_Insert.
But that makes having an Identity value rather redundant, doesn't it? Should I keep the Key columns as identities or just force the population in the package?
February 28, 2008 at 8:26 am
Hmm. The flaw with the variable idea is if there are no records yet, even if you use an isnull(max(customerkey),0), it will fail the task as unable to set the variable because there is no result set.
So, onward to the next thing...
February 28, 2008 at 8:36 am
Sounds to me that you have a whole data scrubbing process to handle before even worrying about the DW. We unfortunately have a whole set of DB's dedicated to just that. with a mapping table of sorts ("these three very similar sets of key info map into this ONE person"), exception handling, search routines to come up with possible dupes, etc.
Proximity searches will be helpful in picking out potential candidates, but it will only get you so far. You will likely also need a way to keep "evil twins" separate once you determine that you have done (we actually had to pull all sorts of tricks, thanks to have two separate Mark A. Smith, MD, both gynecologists, and LIVING WITH EACH OTHER (father and son)).
These are tough problems to crack - and most of the time can't be done purely with automation. Cutting the list down to size is one thing, but the hard cases usually will require human intervention, IMO.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 15 posts - 1 through 15 (of 20 total)
You must be logged in to reply to this topic. Login to reply