October 5, 2007 at 2:35 pm
Hi,
I am involved in the creation of a data mart to model credit card transactions. The grain of our Transaction_Fact fact table is "one row per transaction processed". The dimensions surrounding this fact table are: Date, Merchant, Response, Card (one row per card processed), etc.
The problem we are having is that the card number has a very high cardinality. We could say that one same card is processed no more than 3 times. It means that I will need to insert a new row into Card dimension every 3 rows I insert into Transaction_Fact. That makes my Card dimension a monster dimension that grows nearly at the same speed of the fact table. We could forget the Card dimension and bring the card number into the fact table, but we have several other attributes in the Card dimension (expiration date, card holder data, etc.) and I don't know what to do with them.
How would you recommend we change the design to accommodate the rapidly changing Card dimension? Also, could you please point me to an article discussing the topic, other than the classic order header-detail case of study?
Thanks,
jcoira
// jcoira
October 6, 2007 at 8:10 pm
If you have to tract all that other stuff about a card, why worry about how many rows you have... it's gonna have to be stored somewhere. Perhaps a separate table (fact table?) for all of that?
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2007 at 1:35 pm
Thanks for replying Jeff.
I am worried because of everything I've read about the performance issues of monster dimensions. I know it is not good to have a dimension that grows along with your fact table. I just dont know how to avoid it in this case.
jcoira
// jcoira
October 8, 2007 at 4:27 pm
Guess my other question would be... why do you even need this type of info in a fact table or dimension? This should be simple lookup data.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 9, 2007 at 8:59 am
As others have said, the data has to exist somewhere relationally. As for dimension/fact design, it all depends upon what type of analytics you want to do. Would you really want to browse a dimension all the way down to the card level? My guess is the card itself is not important to regular OLAP analytics. I doubt you would expose the card numbers themselves for browsing, CISP/security requirements notwithstanding. The card is almost a customer dimension.
There are different types of analytics you can do, each with its own potential designs and frequency of use, and source.
OLAP Analytics by card type can be done daily/weekly/monthly for gross reconciliation. They can be done monthly or ad hoc for analysis of processing fees or cost projections for promotions by card type. Reconciliation to the transaction would likely be at the relational level only.
OLAP analytics by customer can be done weekly/monthly for evaluation of loyalty patterns. Data mining queries can be done monthly/ad hoc to identify customers who may churn or be open to promotions.
Each of these is potentially a different dimension design. While you do want to be concerned about performance, in my opinion your first concern should be meeting your business requirements.
Larry
Larry
October 9, 2007 at 10:48 pm
I would agree with Larry, the key here is business requirements - what question are you trying to answer?
Joe
October 10, 2007 at 4:15 pm
Lets analyze two possible scenarios:
1- Analysts do need to browse to the card number or cardholder level when creating or modifying anti-fraud rules. Hence, I need to store the data at the most granular level, which is the card number (this is not my scenario, but I am very interested in how it could be designed)
2- Analysts do NOT need to go down to those levels. Cart type would do for them. However, I still need to support relational reports involving the card number and cardholder data (this is my scenario)
How would you suggest I deal with each of them?
I really appreciate your help on this. I have been struggling to come up with a "clean" design for some time now.
// jcoira
October 11, 2007 at 8:39 am
With each fact row representing a credit card tender, there is nothing wrong with storing both a tender type FK and a tender (card number) id FK on the fact row. Whether or not you create an OLAP dimension for the card depends upon usage. I personally recommend storing a surrogate key only and not the card number on the fact row. My reasons have to do with PCI compliance - let whatever cardnumber masking/encryption/hashing you are doing happen on a narrow table that is not heavily used.
CREATE TABLE TenderFact (
TranId int NOT NULL, --could be surrogate key or
--whatever fields it takes to uniquely id a tran
TranDateId int NOT NULL, --HHMM-level analysis beyond scope of this reply
TenderId int NOT NULL, --the credit card FK
TenderTypeId int NOT NULL, --Visa, MC, etc
TenderExpirationDateId int NOT NULL,
TenderAmount money NOT NULL,
ServicerTranId varchar(xx) NULL,
AuthCode varchar(xx) NOT NULL,
AuthType int NOT NULL, --automated, online, outsourced,
--manual, none (accepted without auth)
IsReturnFlag tinyint NOT NULL,
TaxExemptTranFlag tinyint NOT NULL,
DebitCreditOptionFlag tinyint NOT NULL,
--other flags of interest
--Processing fee-related fields
AddDate datetime,
AddSourceKey ...
)
For scenario 1, I claim you do not need to use a standard OLAP dimension to browse to the card number. What you are talking about is data mining. You are going to set up your "cases" specific to mining and train your model accordingly. DMX queries would then be used to ask the likelihood a particular transaction is fraudulent. For users manually "reviewing fraud rules" I can see browsing some transactions relationally, but not needing an OLAP dimension to be able to peruse all cards. The fact rows having a surrogate foreign key for the card number likely works just fine, unless you are trying to detect fraud for "cards with the following digits in them" or other similar analysis. I don't see enough value for random, occasional searches to justify a dimension with millions of members exposed. YMMV.
For scenario 2, the fact row design could be the same and still support the stated reporting need. But your other relational uses of the data may push your design in a different direction. Again - what are your other requirements?
I guess I don't see the issue. Yes, I would worry about end-user performance browsing a credit card dimension in ProClarity/pivot tables, etc. But considering that PCI compliance issues would likely prevent a normal end-user from seeing the entire card number anyway, the point may be moot. You might as well just display the surrogate key. Should you choose to display a high cardinality dimension in general, see if there are ways to use groupings (manual in 2000, somewhat automated in 2005).
Am I missing something else in your question?
Larry
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply