April 3, 2022 at 12:47 am
Overview:
Table D (20 mil records, 200 columns)
Table S (400k records, 4 columns)
on coalesce(D.[O_NAME_04],'') = coalesce(S.[O_NAME_04],'')
And coalesce(D.[O_NAME_05],'') = coalesce(S.[O_NAME_05],'')
And coalesce(D.[O_NAME_06],'') = coalesce(S.[O_NAME_06],'');
All the above columns may contain null values
This query is taking at least 8 - 9 hours min to produce 355753582 rows
About the environment
Microsoft SQL Server 2016 (SP1) (KB3182545) - 13.0.4001.0 (X64) Oct 28 2016 18:17:30 Copyright (c) Microsoft Corporation Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2012 R2 Standard 6.3 <X64> (Build 9600: ) (Hypervisor)
64GB RAM
HD- TBs of free space
What I have did so far:
1. To avoid coalesce function, I have introduced 3 new columns (Big int)
2. Created cluster index
3. Join these 3 new columns
still no changes in the performance. In other words it is taking more time.
Any suggestions to cut down the run time?
Thanks in advance
RR
April 3, 2022 at 1:45 am
All those NULLs and Blanks that you're joining on? Yeah... guess what they're doing? MASSIVE accidental CROSS JOIN which is more politely called "many-too-many" join just on those alone.
Second, are the name "triplets" in either of the two tables actually unique within the two tables/
Other than that and without you posting the CREATE TABLE and CREATE INDEX statements for either table and haven't told us what you're trying to do and haven't posted the rest of the code for the offending query, there's not much else we can to do help until we have such information.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 3, 2022 at 11:54 am
Create two temporary tables which are copies of D and S but with the additional columns of [O_NAME_04_NOT_NULL], [O_NAME_05_NOT_NULL], [O_NAME_06_NOT_NULL] , then add a clustered index on the name columns.
e.g. for for table D:
SELECT *,
coalesce(D.[O_NAME_04],'') AS [O_NAME_04_NOT_NULL],
coalesce(D.[O_NAME_05],'') AS [O_NAME_05_NOT_NULL],
coalesce(D.[O_NAME_06],'') AS [O_NAME_06_NOT_NULL]
INTO #D
FROM D;
CREATE CLUSTERED INDEX IX_#D_1 on #D([O_NAME_04_NOT_NULL], [O_NAME_05_NOT_NULL], [O_NAME_06_NOT_NULL]);
Do the same for table S.
Then join these two temporary tables using the _NOT_NULL columns:
ON D.[O_NAME_04_NOT_NULL] = S.[O_NAME_04_NOT_NULL]
AND D.[O_NAME_05_NOT_NULL] = S.[O_NAME_05_NOT_NULL]
AND D.[O_NAME_06_NOT_NULL] = S.[O_NAME_06_NOT_NULL]
Hopefully you will then get a merge join which will be about as fast as it can get.
April 3, 2022 at 5:40 pm
I'm thinking that's still going to result in a massive accidental cross join I previously spoke of for the rows where all 3 of the join criteria are blank or null.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 3, 2022 at 8:50 pm
Thanks Jonathan and Jeff for your guidance and support.
Jonathan: Those 3 newly introduced columns contains KEYs I have created and using that for join. YES It is cluster index
Jeff: You are right, it is a massive data explode.
Let me explain why the data exploded. The main table (join keys + data) is the data table, and the reference table (user details + join keys) is the security table. The goal is to figure out who can view what. The output generates data sets for each user. For example if you are me are at the same level and have the same privilege to see a data set, this logic will produce a data set for you as well as for me. This will be used at the reporting level, with an AD-based filter applied.
This was created four years ago by a consultant, however it is now being delayed owing to data growth.
What I am doing?
When it came to my knowledge, I noticed the below issues
I introduced two intermediate tables
Even after these, I wasn't able to find any performance improvement. That's a surprise to me.
Hope I have explained enough!
Thanks & Regards
RR
April 3, 2022 at 9:52 pm
Ok... so the first and very important step is to "know thy enemy". With that, we have to know the "gazintas" for both tables. Run the following query one each of your two tables. You might want to store the results of each run in a working table so that you can do additional analysis later...
SELECT Name_04, Name_05, Name_06, Cnt = COUNT(*)
INTO dbo.SomeWorkingTable --TODO: Change the table name here
FROM dbo.OneOfTheTwoTables --TODO: Change the table name here
GROUP BY Name_04, Name_05, Name_06
;
Then figure out what it really is that you're after from any group of rows identified in the working table as having more than 1 row in the group.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 4, 2022 at 2:04 am
to produce 355753582 rows
Producing 335 mil rows cannot be fast, no matter what indexes are used in a query.
And any suggestions regarding COALESCE replacement might be generally correct but totally useless in this case.
You need to rethink the whole approach to the query.
How that huge recordset is used further?
If at the end you need to figure out the set of privileges for a specific user, then you need to build a query for a specific user, not for all of them at once.
Then that indexing and other optimization techniques might come to play.
_____________
Code for TallyGenerator
April 4, 2022 at 2:03 pm
(1) You need to verify that the 3 values are unique on at least one table; otherwise you will get the equivalent of a CROSS JOIN for all non-unique groups. If the 3 values are not unique in one table, add $IDENTITY to the key to make it unique without SQL having to "uniquify" it.
(2) Are you encoding the char values to convert them to a bigint? (3)I don't see why you need a bigint rather than just an int, given that you have fewer than 60M different values.
(4) What is the fragmentation of the tables like?
(5) What is the tables' DDL, the query text and the query plan (estimated is fine for now)?
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
April 4, 2022 at 8:42 pm
Hey Jeff,
"SELECT Name_04, Name_05, Name_06, Cnt = COUNT(*)"
I executed the above query in the reference (security) table and the findings are as below. In total 1271 rows in total.
Cnt
299
300
301
302
303
304
305
306
307
308
309
311
312
313
314
315
316
317
318
319
320
321
322
323
326
328
337
April 4, 2022 at 8:49 pm
Hi Sergiy
Thanks for your time. I think so. It a new learning for me as the Index process took more time than the non-index query 🙂
The whole idea is to create a data set for each individual users + their access level. The output of these joins will bring an addition column as a username for each row. This username is the key value for each users to filter their entitlement to see the data.
Thanks & Regards
RR
April 5, 2022 at 3:52 am
The whole idea is to create a data set for each individual users + their access level. The output of these joins will bring an addition column as a username for each row. This username is the key value for each users to filter their entitlement to see the data.
So, you better put those joins into a iTVF (aka "parameterized view) and invoke the function for each particular user/access privilege.
If built correctly, such function will be lightning fast very effective.
Building those "overall" huge recordsets is as terrible as common approach to this kind of tasks.
_____________
Code for TallyGenerator
April 5, 2022 at 10:26 pm
Hey Jeff,
"SELECT Name_04, Name_05, Name_06, Cnt = COUNT(*)"
I executed the above query in the reference (security) table and the findings are as below. In total 1271 rows in total.
Cnt 299 300 301 302 303 304 305 306 307 308 309 311 312 313 314 315 316 317 318 319 320 321 322 323 326 328 337
So, what is in that table that you need to query all 20 million rows instead of just the 1,271 rows? Keep in mind that you've told us absolutely nothing about what is in the tables other than the 3 columns you're joining on and you've posting nothing of your query except the join criteria. You've not even explained what exactly it is that you're looking for other than the very nebulous description of "reference (security) table" and equally nebulous requirement of "The whole idea is to create a data set for each individual users + their access level."
The code I posted is step 1 and that is to identify the individual users. But, like I said, everything else is a total guess because of how little you've provided. You didn't even provide how long the query I provided took to run. :p
--Jeff Moden
Change is inevitable... Change for the better is not.
April 7, 2022 at 8:19 pm
My apologies, COVID delayed my progress.
Start working slowly.
Regards
RR
April 8, 2022 at 12:44 am
My apologies, COVID delayed my progress.
Start working slowly.
Regards
RR
You came down with COVID? Sorry to hear and glad you're still around. I know I sound like a Grandma but take care of yourself first and get some good sleep.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply