Can this Query be tuned ? God knows !!!

Question

Can this Query be tuned ? God knows !!!

John Burchel

Hall of Fame

Points: 3611
More actions
August 3, 2006 at 8:36 am

#61596

There is a Clustered index on Termination_Date, the reason being that DailyFeed20060801 needs to be joined to the Temp table, the temp table itself has an index on it.

INSERT DailyFeed20060801(tbMRFileStatus_id,
                                    CDRFingerPrint,
                                    EventType,
                                    StartDate,
                                    StartTime,
                                    UserProvCPN,
                                    RouteIg,
                                    RouteOg,
                                    CalledPN,
                                    CallingPN,
                                    ElapsedTime,
                                    ContCalledNum,CallRedirInd,CalledPNType,TelServInd,SwitchName,

                                    TypeOfCall,

                                    Termination_Date,
                                    Location_id,TelephoneIndex_id,Switch_id,ClientRecognitionMethod,Client_id,
                                    Rate_id,
                                    ClientService_id,
                                    Overhead,OverheadType,Overhead_id,Charge,ChargeType,ChargePrice_id,
                                    Invoice_id,
                                    Status,
                                    MinimumCost,
                                    CallAddition
        ,[Export_id],
                                    ATelephoneIndex_id,ALocation_id,
                                    [Rate1Duration],
                                    [Rate1Charge],
                                    [Rate2Duration],
                                    [Rate2Charge],
                                    [Rate3Duration],
                                    [Rate3Charge]
                                    )
                        SELECT
                                    p.tbMRFileStatus_id,
                                    p.CDRFingerPrint,
                                    p.EventType,
                                    p.StartDate,
                                    p.StartTime,
                                    p.UserProvCPN,
                                    p.RouteIg,
                                    p.RouteOg,
                                    p.CalledPN,
                                    p.CallingPN,
                                    p.ElapsedTime,
                                    p.ContCalledNum,
                                    p.IDACode ,
                                    p.CalledPNType,
                                    p.TelServInd,
                                    p.SwitchName,

                                    p.TypeOfCall,

                                    Convert(DateTime,p.Termination_Date),
                                    p.CULocation_id,
                                    p.TelephoneIndex_id,
                                    p.Switch_id,
                                    p.CUClientRecognitionMethod,
                                    p.CUClient_id,
                                    p.CURate_id,
                                    p.CUClientService_id,
                                    p.CUOverhead,
                                    p.CUOverheadType,
                                    p.CUOverhead_id,
                                    p.CUCharge,
                                    p.CUChargeType,
                                    p.CUChargePrice_id,
                                    p.CUInvoice_id,
                                    p.CUStatus,
                                    p.CUMinimumCost,
                                    p.CUCallAddition
        ,CUExport_id ---1
        ,p.CUATelephoneIndex_id
        ,CUALocation_id,
                                    p.[CURate1Duration] ,
                                    p.[CURate1Charge] ,
                                    p.[CURate2Duration],
                                    p.[CURate2Charge] ,
                                    p.[CURate3Duration],
                                    p.[CURate3Charge]
                        From #CDRDataProcess P
                        Left join DailyFeed20060801 ToInsertTable ON p.Termination_Date = ToInsertTable.Termination_Date
                                                                        and p.EventType = ToInsertTable.EventType
                                                                        and p.CDRFingerPrint = ToInsertTable.DAILYFingerPrint
                        where ToInsertTable.Termination_Date IS NULL
                        and p.EventType & 4 <> 0
                        and p.tbMRFileStatus_id = 974618    (This would change all the time)
                        and RTRIM(Convert(varChar(10), p.Termination_Date , 112 )) = '20060801' ORDER BY p.Termination_Date, p.EventType, p.DAILYFingerPrint

Kindest Regards,
John Burchel (Trainee Developer)

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply

Ray M SSC-Insane Points: 21093 More actions · Answer 1

SUre it can be tuned. A couple things.

The left join on DailyFeed20060801 ToInsertTable can be a problem. Figure out a way to write this so you don't need a left join. Try using a not exists in the where clause.

also applying a convert function to a column in the where clause can be a problem.

Try this

where p.Termination_Date >= convert(datetime, '20060801')

and p.Termination_Date < dateadd(dd,1,convert(datetime, '20060801'))

John Burchel Hall of Fame Points: 3611 More actions · Answer 2

The problem is that '20060801' equates to the current date, so its nto a constant.

Also the query is a dynamic one.could be '20060729' can either be forwards or backwards

Kindest Regards,

John Burchel (Trainee Developer)

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 3

John,

What format is the p.Termination_Date column in? Because it's a CDR, I'm thinking it's probably in the YYMMDD format but want to find out for sure before I spend any time on this one...

{edit} I also need to know why you are joining the "p" table to the same table you are inserting into...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

John Burchel Hall of Fame Points: 3611 More actions · Answer 4

Hi Jeff

Thanks for your interest.

Yes, the format of termination_date is YYMMDD

Also, I have had to join on the table itself because, I dont want to insert whats already been inserted into the table, as the data is from a raw text file, I have to make chacks to ensure the data isnt already in the table.

The other way is to insert the content of DailyFeed20060801 into a temp table, then do a left join on that table, but the problem is that DailyFeed20060801 might already contain 2 million records in it, so do I want to be puting 2m records into a temp table every 5 minutes ?

Thanks in advance

Kindest Regards,

John Burchel (Trainee Developer)

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 5

I suspect that part of the reason that it's taking so long is because of the self join during the insert... the criteria may have to be reevaluated after every record is inserted (I've seen it happen). One way to avoid the need for the join would be to put an index on the daily feed table that would require UNIQUE records for the columns in the join but would also be setup to ignore dupes which would eliminate the need for the left outer join on the big table altogether. That still might be a bit slow, though... more on that further down.

In the meantime, the way to use Ray's good method for checking the Termination_Date column still stands as a great idea but with a few changes to accomodate the YYMMDD format and to automatically accommodate today's date... this will also cause a match by datatype which may allow your index to be used (which I would also expand to include the other criteria columns)...

where p.Termination_Date >= CAST(CHAR(6),GETDATE(),12)

and p.Termination_Date < CAST(CHAR(6),GETDATE()+1,12)

For more flexibility, you could declare a variable at the beginning called something like @DesiredDate DATETIME and set it in one spot either as GETDATE() or the desired data you want to work with. Then, the above snippet would look like...

where p.Termination_Date >= CAST(CHAR(6),@DesiredDate ,12)

and p.Termination_Date < CAST(CHAR(6),@DesiredDate +1,12)

The other thing that is likely just crushing your performance is the ORDER BY. Sorting is absolutely one of the most expensive things you can possibly do. And, guess what? You have nothing in the daily feed table to preserve that order because your clustered index doesn't cover the other two columns.

I don't know how many rows a daily feed table will eventually have in a given day nor how many dupes you could have in a given day (I certainly understand you will have dupes because of the nature of CDRs but the number should be small) but a really good way to get some speed out of this is to get rid of the join altogether and forget about the dupes, temporarily. The key is to capture the data in a permanent working table, first. At the end of the day, you could simply do a SELECT UNIQUE/INTO into a more permanent final table. Even with as many as ten million records, that would be a lot faster than trying to do a self outer join on the table you're trying to insert into and would also (maybe) faster than the idea of using a UNIQUE IGNORE DUPES index on the daily feed table.

If you are taking a feed from a switch produced table that grows every day and resets at the end of the day, not considering the dupes may not be practical because you'll end up with millions of dupes, as you said.

Lemme know what the condition of your source data is so we can figure out the best way to handle potential dupes... is it a daily growing table? Is it a table that holds a certain number of records and only the oldest records are discarded as new records arrive (many switches do it this way)?

In the meantime, like I said, the clustered index on your target table doesn't really preserve the order you imply with your ORDER BY... I'd get rid of the ORDER BY and change the Termination_Date criteria as I suggested. Once that's done, you may want to run the SELECT through an Estimated Execution Plan and, possibly, through the Index Tuning Wizard. The Tuning Wizard isn't a panacea but it sometimes helps.

I gotta go to work... I'll take another peek tonight...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

John Burchel Hall of Fame Points: 3611 More actions · Answer 6

Hi Jef,

Thanks for your reply.

To answer some of your questions, there is a Unique Clustered index Primary key on Termination_Date, so I believe this would filter out dupes.

Also, the way the table works is this.

A BCP job fetches data from disk to a temporary table.

That temporary table is called CDRDataProcess .

So potentially, there would be 20000 text files to be BCP'ed into this table, once the bcp data goes into CDRDataProcess. It is then inserted into the Daily table. It BCP's the next text file, and inserts into the DailyFeed table.

Bearing this in mind, how do you think it can best work ?

Kindest Regards,

John Burchel (Trainee Developer)

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 7

The following (code) tells me that your clustered index on Termination_Date isn't covering enough of the columns... if it is, then why are you joining to EventType and CDRFingerPrint, as well? If you have an index just exclude records by termination date, couldn't you have CDR's that have the same termination date that still need to be inserted based on the differences in EventType and CDRFingerPrint? Also, a clustered index doesn't exclude any data... it must be a UNIQUE index as well... that's why I think it needs to be 3 columns...

From #CDRDataProcess P

Left join DailyFeed20060801 ToInsertTable ON p.Termination_Date = ToInsertTable.Termination_Date

and p.EventType = ToInsertTable.EventType

and p.CDRFingerPrint = ToInsertTable.DAILYFingerPrint

Also, if you have a clustered index and things are a bit out of order, it will take extra time to do the insert. So will an ORDER BY. I believe that if you insist on deleting dupes during the INSERT and having an index, then you need to make it a non-clustered UNIQUE with IGNORE DUPES on the 3 columns in the join above. Then, get rid of the join and the order by and let it work.

I guess my big question would be, why don't you load all 20,000 files into a single table (CDRDataProcess) all at the same time with an ignore dupes index? Also, most indexes only help a bit if you have no primary key... what would the primary key on this table be? THAT should be the WITH IGNORE DUPES index and PLEASE don't make it a clustered one... that's only going to slow things done due to the unordered nature of the input data.

I'm thinking that you didn't really read my last post... You still owe me some info... what's the max number of rows that the daily feed table should get to and if we let the dupes occur, how many dupes would it contain? AND, what column or columns would you use as a PRIMARY KEY? We have to know the condition of the data before we proceed with how to handle it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

John Burchel Hall of Fame Points: 3611 More actions · Answer 8

Hi Jef

The Column CDRFingerprint is the primary key, and it also holds the clustered index.

The index specification is as follows

PK_DailyFeed20060801|clustered, unique, primary key |TErminationDate, EventType, CDRFingerPrint

So the 3 keys, are Unique, Clustered indexes and also the primary key in the table.

The maximum number of rows on the table can get upto 10million, as for dupes, its hard to tell because the data that goes into this table comes in from raw text files, so one has to count the table.

As the 3 keys above are the primary key, by default they have a clustered index on them.

Any more info, let me know.

Kindest Regards,

John Burchel (Trainee Developer)

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 9

John,

Just one more question... is the CDRFingerPrint unique no matter how many CDR's you load?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

John Burchel Hall of Fame Points: 3611 More actions · Answer 10

Hi Jeff,

Yes, CDRFingerprint is unique.

Thanks in advance

Kindest Regards,

John Burchel (Trainee Developer)

John Burchel Hall of Fame Points: 3611 More actions · Answer 11

Hi Jeff,

Yes, CDRFingerprint is unique on the tables.

Thanks

Kindest Regards,

John Burchel (Trainee Developer)

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 12

Crud, I asked the wrong question...

If the CDRFingerPrint is duplicated, does that give an indication that the same CDR is in the temp table more than once?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 13

I guess that doesn't even matter...

Whatever you use to uniquely identify a CDR with (1 column or multiple columns), here's what I'd do...

First, I just don't understand the need for "daily feed" tables that are named after dates... in the call accounting packages I wrote in the past, I would load the raw CDR files into a temp table, just as you have. I'd add an extra column called "Keeper" and set it to "1" for every record. Then, I'd do a single update/join to my 90 day table using an INNER join and mark all records in the temp table it found with "Keeper = 0'. Then, I'd simply insert all the records from the temp table that had "Keeper = 1'.

In other words, if you want to make this fast, stop trying to tune one big assed query... break it into smaller parts. You'll be amazed at how breaking a large problem into several smaller steps will increase the performance of your queries. Trying to check for dupes AS you do an insert is "Death by SQL"... check the records for dupes first and then mark the dupes. Then, insert everything that isn't marked as a dupe. It'll run a LOT faster for multiple reasons.

You can do the same with your daily feed tables but that's another story... you saw what I posted on your other posting... I don't believe there's a need for tables that are date named. If you need to create a daily feed table just to trim down what you need to work with, create it without a date, copy the required data from your big table (only takes 2.5 minutes for about 10 million rows), add an index or two, work with the data, create your eport files, and then drop the table. And use the archiving method I proposed in the other post...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

John Burchel Hall of Fame Points: 3611 More actions · Answer 14

Thankd for your reply JEff, with regards to naming files daily, the problem is that w eneed to knwow aht transactions came in on a daily basis, thats the reason why the tables are named daily.

Also,we use a reporting platform in which you can query on a daily, weekly, monthly basis. Also these tables have different types, that are received daily.

DailyFeedtypeA20060805

DailyFeedtypeB20060805

So, the idea of having one table isnt going to work, also these 3 types of tables have a size of around 25GB in size, and my maintainence jobs run on these daily tables, rather then one table, these daily tables also reside in filegroups according to the month.

Eg Filegroup200608

Filegroup200607 for july etc

I am not looking for wholesale changes, just a quick fix. I look forward to hearing you you.

Kindest Regards,

John Burchel (Trainee Developer)