August 5, 2006 at 7:11 pm
Thankd for your reply JEff, with regards to naming files daily, the problem is that w eneed to knwow aht transactions came in on a daily basis, thats the reason why the tables are named daily. Also,w... |
That's what Celko and I have been talking about... that's what datetime columns are for. But, forget about that... as you said, that would be a major rewrite.
Anyway, the short fix is as I said previously... make a unique index with the ignore dupes option on whatever columns in your temp table that make a CDR unique. As you BCP the data in, duplicate rows will be treated as errors which, if you want, can actually be squirreled off to an errata file. Check Books on line for the necessary switches...
OR... If you don't want to do it on the temp table, you can add the unique/ignore dupe index on the daily feed table... it'll just reject duplicate records without causing the whole batch to abort. It'll let you know if dupes were ignored for each batch but it won't abort.
Then, change your query just to do a joinless INSERT/SELECT from one table to the other. I tried it with a table more narrow than yours but it inserted 10.1 million records, 100,000 of which were dupes (and properly rejected), in just over 7.5 minutes. It'll do 10,000 records in the blink of an eye. And that's on my destop single processor box with IDE drives... imagine what it'll do on a properly configured server... it'll fly.
That's all you need to do...
--Jeff Moden
Change is inevitable... Change for the better is not.
August 6, 2006 at 1:58 pm
Hi Jeff,
Thanks for your reply.
If you look at the code I initially posted, you will see that the insert happens in chunks, so if you actually looked at the original code, I sent where it goes
p.tbMRFileStatus_id = 974618
This would always change all the time, so where p.tbMRFileStatus_id = 974618 could mean that 2000 records would get inserted into the daily table, on th next iteration in the cursor, p.tbMRFileStatus_id could be 974619, which in turn could mean that 1000 records would get inserted into the daily table from the temporary table.
Also, this is how the temp table works/created. This is actually the code behind it.`
BULK INSERT #temPLOADERTABLE FROM '\\FileLocation' WITH ( FIELDTERMINATOR = '~', DATAFILETYPE= 'CHAR', ROWTERMINATOR = '\n', KEEPIDENTITY , TABLOCK )
alter table #temPLOADERTABLE add constraint Indexx_temptable Primary Key Clustered( SeizeDate , EventType , CDRFingerPrint )
Having seen this code, please let me know where your suggestions fit in. If you can also provide code samples, that would be really helpful as I find it easier to decode code rather than explanations ! weird, I know.
Thanks
August 6, 2006 at 2:10 pm
That's why I did a test on 10 million records... your chunks will be easy.
Your temp table load is fine...
I've suggested it 3 times now, John... for the temp table to daily feed table copy, remove the clustered primary key... it's just slowing you down... add a non-clustered unique with ignore dupes key to the columns that make a CDR unique. Change you query to a straight insert with no joins.
If you can't write the query, lemme know...
--Jeff Moden
Change is inevitable... Change for the better is not.
August 6, 2006 at 4:46 pm
August 6, 2006 at 6:23 pm
Ok... here we go... let's NOT do it on the BULK INSERT side... let's do it where you tried to insert into the daily feed file from the temp table as in your first post...
I don't know what you have that creates the daily feed file(s) but the index you want on them is as follows... and, whatever you do... DO NOT PUT A CLUSTERED INDEX OF ANY TYPE ON THE DAILY FEED FILES!!!! IT WILL KILL THE PERFORMANCE ON THESE HIGHLY TRANSACTIONAL / FRAGMENTED TABLES DURING INSERTS!!!!! If you add a Primay Key, make sure that you don't add it until you are done loading ALL data from ALL files into a daily feed table. ONLY THEN SHOULD YOU ADD A CLUSTERED KEY!!!
Here's what the NON-CLUSTERED UNIQUE WITH IGNORE DUPES index look like...
CREATE UNIQUE INDEX UIX_IgnoreDupes
ON dbo.insertdailyfeedtablenamehere (DAILYFingerPrint,EventType,Termination_Date)
WITH IGNORE_DUP_KEY
GO
With a little forethought on your part, you might even be able to make it dynamic in a manner similar to what I've done below...
Here's a stored proc to do you insert into the daily feed table, whatever it is... please notice that there are two parameters you must pass... one for the value of tbMRFileStatus_id and one for the date-part of the date-named daily feed file. If you leave the second parameter off, it will default to whatever today is.
Note that dynamic SQL is required because of the (no offense, well maybe to the guy that designed such a woeful system) because your daily feed file name will change all the time on a whim and at least once a day.
CREATE PROCEDURE dbo.AddToDailyFeed
@tbMRFileStatus_id INT, -- This first parameter is required.
@DesiredDate DATETIME = GETDATE() -- Optional parameter defaults to now if it is not passed
AS
--===== Declare local variables
DECLARE @DateYYYYMMDD VARCHAR(8) --Date for daily feed file name
DECLARE @DateYYMMDD VARCHAR(6) --Date to compare against Termination_Date
DECLARE @DynSQL VARCHAR(8000) --Holds the dynamic SQL required because of the date-named table (ack!)
DECLARE @MyRowCount INT --Hold resulting rowcount for display feedback
DECLARE @MyError INT --Hold resulting error value for display feedback and error checking by main rountine
--===== Set the date variables
SET @DateYYYYMMDD = CONVERT(VARCHAR(8),@DesiredDate,112) --For the file name
SET @DateYYMMDD = CONVERT(VARCHAR(8),@DesiredDate, 12) --For the temination date comparison
--===== Identify the Start of the proc to the output device... whatever it may be
PRINT REPLICATE('=',78)
PRINT 'AddToDailyFeed started at ' + CONVERT(VARCHAR(30),GETDATE(),109)
PRINT 'tbMRFileStatus_id: ' + CAST(@tbMRFileStatus_id AS VARCHAR(10))
PRINT 'Daily feed name: DailyFeed' + @DateYYYYMMDD
PRINT REPLICATE('-',78)
--===== Create the dynamic SQL to support date-named files, etc.
SET @DynSQL = '
INSERT INTO DailyFeed' + @DateYYYYMMDD + '
(
tbMRFileStatus_id,
CDRFingerPrint,
EventType,
StartDate,
StartTime,
UserProvCPN,
RouteIg,
RouteOg,
CalledPN,
CallingPN,
ElapsedTime,
ContCalledNum,
CallRedirInd,
CalledPNType,
TelServInd,
SwitchName,
TypeOfCall,
Termination_Date,
Location_id,
TelephoneIndex_id,
Switch_id,
ClientRecognitionMethod,
Client_id,
Rate_id,
ClientService_id,
Overhead,
OverheadType,
Overhead_id,
Charge,
ChargeType,
ChargePrice_id,
Invoice_id,
Status,
MinimumCost,
CallAddition
,Export_id,
ATelephoneIndex_id,
ALocation_id,
Rate1Duration,
Rate1Charge,
Rate2Duration,
Rate2Charge,
Rate3Duration,
Rate3Charge
)
SELECT
tbMRFileStatus_id,
CDRFingerPrint,
EventType,
StartDate,
StartTime,
UserProvCPN,
RouteIg,
RouteOg,
CalledPN,
CallingPN,
ElapsedTime,
ContCalledNum,
IDACode ,
CalledPNType,
TelServInd,
SwitchName,
TypeOfCall,
CONVERT(DATETIME,Termination_Date) AS Termination_Date,
CULocation_id,
TelephoneIndex_id,
Switch_id,
CUClientRecognitionMethod,
CUClient_id,
CURate_id,
CUClientService_id,
CUOverhead,
CUOverheadType,
CUOverhead_id,
CUCharge,
CUChargeType,
CUChargePrice_id,
CUInvoice_id,
CUStatus,
CUMinimumCost,
CUCallAddition,
CUExport_id,
CUATelephoneIndex_id,
CUALocation_id,
CURate1Duration ,
CURate1Charge ,
CURate2Duration,
CURate2Charge ,
CURate3Duration,
CURate3Charge
FROM #CDRDataProcess --table must already exist before you call this proc!!!!!!
WHERE EventType & 4 <> 0
AND tbMRFileStatus_id = ' + CAST(@tbMRFileStatus_id AS VARCHAR(10)) + '
AND Termination_Date = ''' + @DateYYMMDD +'''' --Quotes required around string date here...
--===== All set, execute the dynamic SQL with the calculated filename and all...
EXEC (@DynSQL)
--===== Capture both the rowcount and the error value, if any
SELECT @MyRowCount = @@ROWCOUNT,
@MyError = @@ERROR
--===== Identify the End of the proc to the output device... whatever it may be
PRINT REPLICATE('-',78)
PRINT 'AddToDailyFeed ended at ' + CONVERT(VARCHAR(30),GETDATE(),109)
PRINT 'Rowcount: ' + CONVERT(VARCHAR(10),@MyError)
PRINT 'Error value: ' + @MyError
PRINT REPLICATE('=',78)
--===== Exit the proc using the error value (good or bad) as the return code
RETURN @MyError
GO
--===== Grant execute permissions on the new proc to everybody
GRANT EXECUTE ON dbo.AddToDailyFeed TO PUBLIC
GO
If you need more than this, I'm going to have to start charging for production code ... after all, YOU are the "Senior Developer" here...
I gotta say it again just so I can get the bad taste of the vomit code I just wrote for you out of my mouth... the overall design of your system is terrible. Any system that requires daily date named tables is just wrong and the designers of the system need to be made to suffer a public execution. THAT's how bad it is...
--Jeff Moden
Change is inevitable... Change for the better is not.
August 8, 2006 at 4:40 am
Hello Jeff,
Thanks very much for your help, wondered how much you would charge a Trainee Developer, as you can tell from my code...
I want to ask one other question, with the IGNORE_DUP_KEY option, how can I write duplicate entry onto a side table somewhere ?
Looking forward to hearing from you.
Thanks
John
August 8, 2006 at 6:03 am
Your humility is surpassed only by your good nature, John. Well done.
The only time you can do something like what you ask automatically is during the BCP process where you can log all errors (dupes, in this case) to a separate file and then import that file to a separate table. Although it's quite effective, I think it would be a bit much for this project with all the date-named files you have and all...
You could do it using a separate INSERT into a separate table either before or after the insert from the temp table to the daily feed table (Insert won't do any logging like BCP can in the presence of an IGNORE_DUPES index). The following select will find the dupes in the temp table that you load the daily feed tables from... Works best if you have a decent index on the temp table that covers the columns in the WHERE clause... Just add it to an INSERT to do what you ask...
SELECT DISTINCT t.*,d.DupeCount
FROM #CDRDataProcess t,
(--==== Derived table "d" finds dupes by key and provides count, as well
SELECT CDRFingerPrint,EventType,Termination_Date,COUNT(*) AS DupeCount
FROM #CDRDataProcess
GROUP BY CDRFingerPrint,EventType,Termination_Date
HAVING COUNT(*) > 1
) d --End derived table "d"
WHERE t.CDRFingerPrint = d.CDRFingerPrint
AND t.EventType = d.EventType
AND t.Termination_Date = d.Termination_Date
Just in case you didn't know... a "derived table" is nothing more than a correctly formed sub-query (NOT a correlated sub-query which can be quite slow) whose return is used as if it were a table. They're pretty easy to troubleshoot compared to correlated sub-queries, as well.
To make one, you write a query that returns what you want correctly, wrap it in parenthesis, give it a table alias ("d" in my example), and put it in a FROM clause of another query as if it were a table. Some folks also call these "inline views" and they can be quite useful in dividing a larger problem into more manageable pieces. They can also be Nasty fast and the sum of the individual run times usually outweighs the conglomerate run time by several times. In other words, the whole runs faster than the sum of the parts, usually.
Considering the width of your CDR records, you may want to remove the "DISTINCT" in the code above for the sake of speed. Try it both ways to see which way you want to go...
--Jeff Moden
Change is inevitable... Change for the better is not.
August 10, 2006 at 10:30 am
Hello Jeff / Hello All
Thanks for your last post, I have another quick question for you, with regards to the WITH IGNORE_DUP_KEY options, I just realized that should duplicate values be inserted into a table, the statement terminates, is there anything one can do that ignore these duplicate values and carry on working.
My concern is that if one was loading data in the middle of the night, and an insert failure happens, it would get stock there until someone intervenes into it.
I look forward to hearing from you.
Thanks
August 13, 2006 at 2:17 pm
Not sure why you think the statement terminates...
--Jeff Moden
Change is inevitable... Change for the better is not.
August 28, 2006 at 3:22 am
Hi All,
I have been running this for some time now and found out that sometimes the inserts is a bit slow.
INSERT INTO DailyFeed' + @DateYYYYMMDD + '
(
tbMRFileStatus_id,
CDRFingerPrint,
EventType,
StartDate,
StartTime,
UserProvCPN,
RouteIg,
RouteOg,
CalledPN,
CallingPN,
ElapsedTime,
ContCalledNum,
CallRedirInd,
CalledPNType,
TelServInd,
SwitchName,
TypeOfCall,
Termination_Date,
Location_id,
TelephoneIndex_id,
Switch_id,
ClientRecognitionMethod,
Client_id,
Rate_id,
ClientService_id,
Overhead,
OverheadType,
Overhead_id,
Charge,
ChargeType,
ChargePrice_id,
Invoice_id,
Status,
MinimumCost,
CallAddition
,Export_id,
ATelephoneIndex_id,
ALocation_id,
Rate1Duration,
Rate1Charge,
Rate2Duration,
Rate2Charge,
Rate3Duration,
Rate3Charge
)
SELECT
tbMRFileStatus_id,
CDRFingerPrint,
EventType,
StartDate,
StartTime,
UserProvCPN,
RouteIg,
RouteOg,
CalledPN,
CallingPN,
ElapsedTime,
ContCalledNum,
IDACode ,
CalledPNType,
TelServInd,
SwitchName,
TypeOfCall,
CONVERT(DATETIME,Termination_Date) AS Termination_Date,
CULocation_id,
TelephoneIndex_id,
Switch_id,
CUClientRecognitionMethod,
CUClient_id,
CURate_id,
CUClientService_id,
CUOverhead,
CUOverheadType,
CUOverhead_id,
CUCharge,
CUChargeType,
CUChargePrice_id,
CUInvoice_id,
CUStatus,
CUMinimumCost,
CUCallAddition,
CUExport_id,
CUATelephoneIndex_id,
CUALocation_id,
CURate1Duration ,
CURate1Charge ,
CURate2Duration,
CURate2Charge ,
CURate3Duration,
CURate3Charge
FROM #CDRDataProcess --table must already exist before you call this proc!!!!!!
WHERE EventType & 4 <> 0
AND tbMRFileStatus_id = ' + CAST(@tbMRFileStatus_id AS VARCHAR(10)) + '
AND Termination_Date = ''' + @DateYYMMDD +'''' --Quotes required around string date here...
I was wondering whether it might be worth having an index on #CDRDataProcess, if so which columns and what kind of indexes would suit here ?
September 2, 2006 at 10:39 am
Try running the SELECT portion by itself... if the performance is similar to the performance of the whole INSERT/SELECT, then you might try using the INDEX TUNING WIZARD (and a couple of "swags") to see if the temp table should get an index (you WILL have to temporarily convert the temp table to a perm table to use the tuning winzard on the SELECT portion).
The other thing that could be slowing things down are indexes on the Daily_Feed (target of Insert) table. If you have a clustered index on anything other than an Identity column, that would cause performance issues. Having lot's of indexes would also cause performance issues because they will all need to be updated for Inserts.
Other problems with performance might be some of the negative logic you've included... for example, this...
WHERE EventType & 4 <> 0
...should be written as this...
WHERE EventType & 4 = 4
...to eliminate the need for the negative logic. (Sorry I missed a couple of these in past postings... was concentrating on a different problem).
--Jeff Moden
Change is inevitable... Change for the better is not.
September 3, 2006 at 12:51 am
September 3, 2006 at 12:53 am
September 3, 2006 at 11:44 am
An hour for 30k records? Obviously, that's not good... I'm thinking that the ORDER BY is what is really slowing things down but I could be wrong... the UNIQUE IGNORE DUPLICATE key isn't clustered so I'd be a bit surprised if it where that. Still, a different tact may be in order... would you post your latest insert code and I'll take a look at it sometime this weekend? Thanks...
--Jeff Moden
Change is inevitable... Change for the better is not.
September 4, 2006 at 12:17 am
INSERT INTO DailyFeed' + @DateYYYYMMDD + '
(
tbMRFileStatus_id,
CDRFingerPrint,
EventType,
StartDate,
StartTime,
UserProvCPN,
RouteIg,
RouteOg,
CalledPN,
CallingPN,
ElapsedTime,
ContCalledNum,
CallRedirInd,
CalledPNType,
TelServInd,
SwitchName,
TypeOfCall,
Termination_Date,
Location_id,
TelephoneIndex_id,
Switch_id,
ClientRecognitionMethod,
Client_id,
Rate_id,
ClientService_id,
Overhead,
OverheadType,
Overhead_id,
Charge,
ChargeType,
ChargePrice_id,
Invoice_id,
Status,
MinimumCost,
CallAddition
,Export_id,
ATelephoneIndex_id,
ALocation_id,
Rate1Duration,
Rate1Charge,
Rate2Duration,
Rate2Charge,
Rate3Duration,
Rate3Charge
)
SELECT
tbMRFileStatus_id,
CDRFingerPrint,
EventType,
StartDate,
StartTime,
UserProvCPN,
RouteIg,
RouteOg,
CalledPN,
CallingPN,
ElapsedTime,
ContCalledNum,
IDACode ,
CalledPNType,
TelServInd,
SwitchName,
TypeOfCall,
CONVERT(DATETIME,Termination_Date) AS Termination_Date,
CULocation_id,
TelephoneIndex_id,
Switch_id,
CUClientRecognitionMethod,
CUClient_id,
CURate_id,
CUClientService_id,
CUOverhead,
CUOverheadType,
CUOverhead_id,
CUCharge,
CUChargeType,
CUChargePrice_id,
CUInvoice_id,
CUStatus,
CUMinimumCost,
CUCallAddition,
CUExport_id,
CUATelephoneIndex_id,
CUALocation_id,
CURate1Duration ,
CURate1Charge ,
CURate2Duration,
CURate2Charge ,
CURate3Duration,
CURate3Charge
FROM #CDRDataProcess --table must already exist before you call this proc!!!!!!
WHERE EventType & 4 = 4
AND tbMRFileStatus_id = ' + CAST(@tbMRFileStatus_id AS VARCHAR(10)) + '
AND Termination_Date = ''' + @DateYYMMDD +''''
Non clustered Ignore Duplicate key on the 3 key fields ( CDRFingerPrint, EventType, StartDate)
Clustered index on #CDRDataProcess( CDRFingerPrint, EventType, StartDate)
Viewing 15 posts - 16 through 29 (of 29 total)
You must be logged in to reply to this topic. Login to reply