July 30, 2012 at 2:04 am
Hi all,
I have to perform some joins between a Fact table (with millions of records) and some contextual tables. My problem is that some tables have an auto incremental field as the PK, but this key is not useful for join operations. The business key is stored in a string field; moreover, this field is not defined as unique and is a String field. When the value of the field is unknown you get a "U". In the following image an example of this situation can be seen:
My first approach was to create new tables with the business key as the index and an ETL to load this table, but the managers of the project have discard this option and I need a clever solution to overcome this drawback.
Any comments or suggestions will be appreciated.
Kind Regards,
Paul
July 30, 2012 at 2:44 am
Hi Paul
Apart from the fact that a simple join between the two tables on konsumfeld
SELECT ...
FROM Tableb b
LEFT JOIN Tablea a ON a.konsumfeld = b.konsumfeld
would eliminate rows from tableb where konsumfeld = 'U', I can't see what the problem is. Why would you want to make another copy of the table in order to index konsumfeld?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
July 30, 2012 at 3:36 am
Hi Chris,
Thanks so much for your quick answer.
My problem is I have to perform 3 Joins with 2 tables with the same situation and 1 table with a numeric key. I don't have control over the source tables, but the business keys always hold numeric values and they are defined as Strings. For a fact table with 482.866 records the query takes about 6 minutes to execute. I don't know how powerful the server is but is this a reasonable amount of time? This fact table contains only delta data (records for 1 day) but imagine the historical table that has more than 165 million of rows.
July 30, 2012 at 4:04 am
You're welcome Paul.
paul.pauldj54 (7/30/2012)
...the business keys always hold numeric values and they are defined as Strings...
A column containing values with leading zero's can only be a string.
...For a fact table with 482.866 records the query takes about 6 minutes to execute. I don't know how powerful the server is but is this a reasonable amount of time? ...
I'd say this is slow. Can you post the actual execution plan of the query? Save it as a .sqlplan file and post it here as an attachment.
Is the business key column the same datatype in all of the tables?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
July 30, 2012 at 5:39 am
Hi Chris,
Attached you can find the execution plan (I've never used this feature, it seems to be very cool).
If you look at the first figure (from my 1st post) there is a column called "KUNDE_KEY" which is a business key, and a numerical field.
You are absolutely right, leading-zero values must be strings, I haven't notice 😀
Now I'll dig deep into execution plans :alien:
Thanks for your time,
Paul
July 30, 2012 at 5:44 am
Hi Paul, can you post the Actual plan please, rather than the estimated plan? There are often differences between the two - for instance, the estimated plan you've just posted indicates that your output from this query is 56 million rows.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
July 30, 2012 at 7:45 am
Hi Chris,
Attached the Actual Excecution Plan,
For other readers, I found this link very useful to understan execution plans from SQL SERVER: http://www.simple-talk.com/sql/performance/execution-plan-basics/
Kind Regards,
Paul
July 30, 2012 at 8:33 am
Hi Paul, thanks for the plan.
Points:
1. No clustered indexes on any of the 4 tables, a suitable clustered index on DeltaFAKT_DETAIL would probably help the expensive SORT in the plan. I'd start with the following columns in that clustered index, in the order shown:
FD.FILIALE_KEY, -- join column
FD.BRANCHE, -- join column
FD.KONSUMFELD, -- join column
FD.KUNDE_KEY,
FD.EK_DATUM,
FD.ABT,
FD.UN_3
I'd also change the GROUP BY to match this order:
GROUP BY
FD.FILIALE_KEY, --
FD.BRANCHE, --
FD.KONSUMFELD, --
FD.KUNDE_KEY,
FD.EK_DATUM,
FD.ABT,
FD.UN_3,
FI.FILIALE,
FI.FILIALE_KTEXT,
FI.FILIALE_VERKAUFSREGION,
FI.FILIALE_SAP_KEY,
BR.BRANCHE,
BR.BRANCHE_BEZ,
BR.BRANCHE_KBEZ,
KON.KONSUMFELD,
KON.KONSUMFELD_BEZ,
KON.KONSUMFELD_KBEZ;
Indexes on the other tables would promote more efficient joins, but their rowcounts are so low that the improvement might not be noticeable.
2. There's a 1:1 join between FILIALE and DeltaFAKT_DETAIL; however the 482,866 row product of this join, once joined with BRANCHE, has 10x the number of rows. Is this correct? Is there a join predicate missing here?
3. Likewise with the join to table KONSUMFELD; the rowcount goes up by a factor of 10.
4. The final output set of 537,859 rows is sufficiently similar to the rowcount of DeltaFAKT_DETAIL to raise an eyebrow.
I'd begin by checking that the join predicates are correct.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
July 30, 2012 at 8:46 am
Chris, you're simply amazing!!!
I've learned a lot from you today.
I'll study carefully your last reply, implement the changes and then write back to the forum to share my results.
Kind Regards,
Paul
July 30, 2012 at 9:03 am
That's very kind Paul but I'm just a jobbing TSQL developer! Thanks anyway.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
July 31, 2012 at 6:55 am
Hi Chris:
I've found one of the problems. I was joining data from BRANCHE and KONSUMFELD using the columns branche and konsumfeld respectively. These columns are not unique, that's the reason for the increment in the row numbers of the resultset. After some tests I realize that there is another column, named UN_3, which should be use together with branche (the same case for konsumfeld) to obtain a unique key.
This database was created by a 3rd party and is still in a development phase and I don't have enough information and metadata. At the moment I have to discover everything by myself (and my best friend, the SQL Management Studio), and my colleagues from sqlservercentral.
I think also that the indexes are disabled because they can cause low performance during loadings.
Thanks for your help
Kind Regards
July 31, 2012 at 7:03 am
Excellent investigative work, Paul.
It's not uncommon for indexes to be disabled during loads - but best practice is to enable them again afterwards 😉
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply