July 11, 2007 at 8:52 am
Hi all,
I am writing a stored procedure to populate a table nightly with approximately 2,700,000 records. Each record has a ranking field and total field that must be populated (e.g. 5 out of 25). The rankings are based on groupings of the records. The groups are zip-sic where sic is a category used in the application. Last time I ran the procedure, it took 8 hrs to finish.
My issue is that when the procedure begins it is much faster. As the cursor steps through the zip-sic combinations it gets slower and slower. During the first minute 74,046 rows were inserted yielding an estimated 36 minutes to completion. With each passing minute this figure decreases thus increasing the time to completion. These are the figures I got for the first 4 minutes:
1- 74,046 (36 min)
2- 59,854 (40 min)
3- 42,197 (46 min)
4- 38,193 (50 min)
It took 57 minutes to insert 979,906 records making the estimated time to completion 157 minutes.
I thought that an index on the insertion table def_rankings, might be the cause if it was continually re-indexing. So I delete the index and then rebuild it at the end of the procedure. That helped but the times above are without the index. My thought is that I am using the temporary tables in the wrong way so that the memory never gets released as the cursor moves forward. Either that or maybe there is some other kind of statistic that is being recalculated with each insert that could be turned off until the cursor is finished. I have also tried getting rid of the cursor and adding an ID to the zip-sic results so that I could loop over them with a while-loop. This actually is slower than the cursor with the first minute completing only 22,878 insertions. And the while-loop still has the problem of slowing down as the procedure continues. I have also tried FORWARD_ONLY STATIC and FAST_FORWARD cursors. They both have the slowing problem.
I know enough to be dangerous, so there may be some obvious corrections in the way I have coded this. Any assistance would be most appreciated. Here is the procedure:
IF EXISTS (SELECT name FROM sysobjects
WHERE name = 'sp_CreateRankings2' AND type = 'P')
DROP PROCEDURE sp_CreateRankings2
GO
/* Stored procedure to create rankings */
CREATE PROC sp_CreateRankings2
AS
--CLEAR OLD RESULTS
truncate table def_rankings
--DELETE THE INDEX FOR FASTER INSERTIONS
DROP INDEX [dbo].[def_rankings].[Index_BE]
--CREATE A TEMP TABLE TO UNIQUELY HOLD THE CITY-STATE FOR EACH ZIP (chooses longest name for each city where duplicates exist)
select *
INTO #temp_map_city2zip
from map_city2zip c2z
where 1=1
and city = (select top 1 city from map_city2zip where zip = c2z.zip order by len(city) desc)
order by zip
--DECLARE MyCursor CURSOR FORWARD_ONLY STATIC FOR
DECLARE MyCursor CURSOR FAST_FORWARD FOR
-- Select all mapped city-zips crossed with all industries
SELECT distinct c2z.zip, zf.sic
FROM #temp_map_city2zip c2z INNER JOIN zipfindzips2 zfz on c2z.zip = zfz.zip
inner join zipfind zf on zfz.adjacentzip = zf.zip
and c2z.zip IN (SELECT zip FROM zipsupdated)
DECLARE @zip varchar(10), @sic int
OPEN MyCursor
FETCH NEXT FROM MyCursor INTO @zip, @sic
WHILE (@@FETCH_STATUS = 0)
BEGIN
--THE FIRST TEMP INSERT IS USED TO GET THE RIGHT ORDER
SELECT zfz.zip, be_no, sic,(points1 + 60-distance) as score, pp_bid, c.city, c.state, distance
INTO #TempRankings2
FROM zipfind z INNER JOIN zipfindzips2 zfz ON z.zip = zfz.adjacentzip
inner join #temp_map_city2zip c on c.zip = zfz.zip
WHERE distance <= 60
AND zfz.zip = @zip
AND z.sic = @sic
ORDER BY pp_bid desc, score desc, distance, be_no
--ONLY THEN CAN WE ASSIGN THE RIGHT RANK
SELECT zip, be_no, sic,score, pp_bid, city, state, distance, identity(int, 1, 1) as rank, 0 as total
INTO #TempRankings
FROM #TempRankings2
--ONLY THEN CAN WE INSERT THE TOTAL NUMBER
INSERT INTO def_rankings (zip, be_no, sic, score, pp_bid, city, state, distance, rank, total)
SELECT zip, be_no, sic, score, pp_bid, city, state, distance, rank, @@IDENTITY
FROM #TempRankings
DROP TABLE #TempRankings
DROP TABLE #TempRankings2
FETCH NEXT FROM MyCursor INTO @zip, @sic
END
DROP TABLE #temp_map_city2zip
--REBUILD THE INDEX
CREATE CLUSTERED INDEX [Index_BE] ON [dbo].[def_rankings]([be_no], [distance]) WITH FILLFACTOR = 90 ON [PRIMARY]
DEALLOCATE MyCursor
GO
Thanks again for any help you can offer.
Sincerely,
Danny
July 11, 2007 at 9:20 am
There is probably a set based solution to this process, but I don't see it yet. One change I think may help is in the declaration of the cursor. Try this and see if it produces the same results for your cursor:
DECLARE MyCursor CURSOR FAST_FORWARD FOR
-- Select all mapped city-zips crossed with all industries
SELECT DISTINCT
c2z.zip,
zf.sic
FROM
#temp_map_city2zip c2z
INNER JOIN zipfindzips2 zfz
on (c2z.zip = zfz.zip )
INNER JOIN zipfind zf
on (zfz.adjacentzip = zf.zip)
INNER JOIN zipsupdated zu
on (c2z.zip = zu.zip)
It eliminates the subquery and may help. All you can do is test, test, and test some more. I am sure others will look at this may provide other alternatives.
July 11, 2007 at 9:20 am
Try to run a profiler when running your procedure in order to see which part of your sp is resource consuming.
Hope this helps
By the way Group by is known to be resource consuming
July 11, 2007 at 9:25 am
Hi Danny,
Another thing that you could try is to created indexes on your temp tables this may help the query joins the tables start to get bigger and bigger.
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
July 11, 2007 at 9:36 am
Also, If you have to use a cursor (no one comes up with a set-based solution), you should try creating the the two # temp tables you use inside your while loop outside the loop and use insert into and truncate on the tables in the loop, then drop the # temp tables after the loop is done (or just let SQL Server drop them when the stored procedure finishes).
July 11, 2007 at 9:39 am
I would have to agree with Lynn on this one.
How may loops is your cursor doing?
It's not recomended to user cursors all the time as they have been know to lock up memory and records.
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
July 11, 2007 at 9:50 am
It is desireable to eliminate cursors if possible. Sometimes, however, they are the proper choice for a specific task. Not having your data handy to test against, it is difficult to know. As I look at your procedure I think I see a way to eliminate the cursor, but it may not work in your situation as it appears you may be using SQL Server 2000. If true, are you or your company looking at upgradingto SQL Server 2005 in the foreseable future?
July 11, 2007 at 10:01 am
Thanks! For the great posts. I am using SQL2000 and as far as I know there are no plans to update. The cursor currently loops over about 120,000 records, and that is bound to grow. My non-cursor version looks something like this excerpt:
SELECT distinct c2z.zip, zf.sic, identity(int, 1, 1) as ID
INTO #zipsic
FROM #temp_map_city2zip c2z INNER JOIN zipfindzips2 zfz on c2z.zip = zfz.zip
inner join zipfind zf on zfz.adjacentzip = zf.zip
and c2z.zip IN (SELECT zip FROM zipsupdated)
WHILE EXISTS (SELECT TOP 1 * FROM #zipsic)
BEGIN
SELECT TOP 1 @id = ID FROM #zipsic
--THE FIRST TEMP INSERT IS USED TO GET THE RIGHT ORDER
SELECT zfz.zip, be_no, zs.sic,(points1 + 60-distance) as score, pp_bid, c.city, c.state, distance
INTO #TempRankings2
FROM zipfind z INNER JOIN zipfindzips2 zfz ON z.zip = zfz.adjacentzip
inner join #temp_map_city2zip c on c.zip = zfz.zip
inner join #zipsic zs on zs.zip = c.zip and zs.sic = z.sic and zs.id = @id
WHERE distance <= 60
ORDER BY pp_bid desc, score desc, distance, be_no
--ONLY THEN CAN WE ASSIGN THE RIGHT RANK
SELECT zip, be_no, sic,score, pp_bid, city, state, distance, identity(int, 1, 1) as rank, @@ROWCOUNT as total
INTO #TempRankings
FROM #TempRankings2
--ONLY THEN CAN WE INSERT THE TOTAL NUMBER
INSERT INTO def_rankings (zip, be_no, sic, score, pp_bid, city, state, distance, rank, total)
SELECT zip, be_no, sic, score, pp_bid, city, state, distance, rank, total --@@IDENTITY
FROM #TempRankings
DROP TABLE #TempRankings
DROP TABLE #TempRankings2
DELETE FROM #zipsic WHERE ID = @id
END
Even though I am able to get rid of the cursor as shown above, I still have to loop over the records one at a time. I guess that does not qualify as a set-based treatment.
Morpheus, I haven't used the profiler before so I'll read up on how to do so. This sounds like it could be a big help in figuring out what is going on. Even though I conceptually do the rankings by group, I don't actually use the GROUP keyword in the procedure. But definitely good to know!
Lynn, I will take your improved select for a test run. I would love to hear if you think of a way to do this task set-wise. Also, I am not sure I can use INSERT INTO on both temp tables because I get an error saying IDENTITY() can only be used with a SELECT with an INTO clause. But I will use this hint wherever it lets me. Just need to look up how to declare a temp table before actually using it.
Christopher, I will also try indexes on the temp tables. But the only table that should be getting larger is the def_rankings table, and it is only used for INSERTS. Since the #temp_map_city2zip and the cursor do not change, and #temprankings and #temprankings2 are dropped each time through the cursor, I wonder if this will solve the slowing problem. No doubt that it will help shave time off the cursor overall, though.
Thanks again for the great hints! Let me know if there is some set-related solution that someone can envision.
Danny
July 11, 2007 at 10:09 am
You can define the identity column in the create table when creating the # temp tables, then you won't need the IDENTITY() function in the insert statement. This will also be reset when you truncate the # temp table at the end of each iteration of the loop.
July 11, 2007 at 10:12 am
Sweet! What a great idea Lynn!
July 11, 2007 at 10:14 am
Also, you may want to put an index on this # temp table: #temp_map_city2zip.
July 11, 2007 at 10:20 am
One last change to suggest. In your procedure you are using @@IDENTITY, you may want to change that to SCOPE_IDENTITY(). If, during your processing an insert to another table with an identity field occurs you may get that value instead of the one you expect.
You may want to read up on these in BOL.
July 11, 2007 at 10:36 am
Could you post the DDL for the tables used in the procedure and some sample data, perhapes enough for 3 quick iterations of representative data?
July 11, 2007 at 12:14 pm
Hi Lynn, Sorry for the slow response. I am not sure how to get you the info you requested except to paste it into the body. I cannot figure out how to attach files. These tables are all large (even for 3 zips) so I will separate the data into different posts for each table. I just output csv files so that is what I am sending. If there is another way to get a scripted set of data, I am unfamiliar with how to do that.
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[map_city2zip]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[map_city2zip]
GO
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[ZipFindZips2]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ZipFindZips2]
GO
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[ZipsUpdated]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ZipsUpdated]
GO
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[ZipFind]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ZipFind]
GO
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[def_rankings]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[def_rankings]
GO
CREATE TABLE [dbo].[map_city2zip] (
[city] [varchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[state] [varchar] (2) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[zip] [varchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[ZipFindZips2] (
[zip] [varchar] (5) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[adjacentZip] [varchar] (5) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[distance] [tinyint] NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[ZipsUpdated] (
[zip] [varchar] (5) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[getdate] [smalldatetime] NOT NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[ZipFind] (
[BE_NO] [int] NOT NULL ,
[zip] [varchar] (5) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[points1] [float] NULL ,
[points2] [float] NULL ,
[points3] [float] NULL ,
[sic] [int] NOT NULL ,
[tid] [int] NOT NULL ,
[pp_bid] [money] NOT NULL ,
[dateCreated] [smalldatetime] NOT NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[def_rankings] (
[zip] [char] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[be_no] [int] NOT NULL ,
[sic] [int] NOT NULL ,
[score] [int] NOT NULL ,
[pp_bid] [money] NULL ,
[city] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[state] [char] (2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[distance] [int] NULL ,
[rank] [int] NULL ,
[total] [int] NULL ,
[dateCreated] [smalldatetime] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[map_city2zip] WITH NOCHECK ADD
CONSTRAINT [PK_map_city2zip] PRIMARY KEY CLUSTERED
(
[city],
[state]
) WITH FILLFACTOR = 90 ON [PRIMARY]
GO
ALTER TABLE [dbo].[ZipFindZips2] WITH NOCHECK ADD
CONSTRAINT [PK_ZipFindZips2] PRIMARY KEY CLUSTERED
(
[zip],
[adjacentZip]
) WITH FILLFACTOR = 90 ON [PRIMARY]
GO
ALTER TABLE [dbo].[ZipsUpdated] WITH NOCHECK ADD
CONSTRAINT [PK_ZipsUpdated] PRIMARY KEY CLUSTERED
(
[zip]
) WITH FILLFACTOR = 90 ON [PRIMARY]
GO
ALTER TABLE [dbo].[ZipFind] WITH NOCHECK ADD
CONSTRAINT [PK_ZipFind] PRIMARY KEY CLUSTERED
(
[BE_NO],
[zip],
[sic],
[tid]
) WITH FILLFACTOR = 90 ON [PRIMARY]
GO
ALTER TABLE [dbo].[ZipsUpdated] WITH NOCHECK ADD
CONSTRAINT [DF_ZipsUpdated_getdate] DEFAULT (getdate()) FOR [getdate]
GO
ALTER TABLE [dbo].[ZipFind] WITH NOCHECK ADD
CONSTRAINT [DF__zipFind__pp_bid__6E4C3B47] DEFAULT (0) FOR [pp_bid],
CONSTRAINT [DF__zipFind__dateCre__6F405F80] DEFAULT (getdate()) FOR [dateCreated]
GO
ALTER TABLE [dbo].[def_rankings] WITH NOCHECK ADD
CONSTRAINT [DF_danny_dateCreated] DEFAULT (getdate()) FOR [dateCreated]
GO
CREATE INDEX [Index_ZipFindZips2_zip] ON [dbo].[ZipFindZips2]([zip], [distance]) WITH FILLFACTOR = 90 ON [PRIMARY]
GO
CREATE INDEX [Index_ZipFindZips2_adj_zip_distance] ON [dbo].[ZipFindZips2]([adjacentZip], [distance]) WITH FILLFACTOR = 90 ON [PRIMARY]
GO
CREATE INDEX [Index_ZipFindZips2_all] ON [dbo].[ZipFindZips2]([zip], [adjacentZip], [distance]) WITH FILLFACTOR = 90 ON [PRIMARY]
GO
CREATE INDEX [Index_ZipFindZip2_zip_only] ON [dbo].[ZipFindZips2]([zip]) WITH FILLFACTOR = 90 ON [PRIMARY]
GO
July 11, 2007 at 12:15 pm
map_city2zip
==================
miami,FL,33101
miami beach,FL,33109
new york,NY,10001
new york city,NY,10001
ny,NY,10001
nyc,NY,10001
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply