June 4, 2009 at 10:58 am
[font="Courier New"]I have a table of purchases with 50 million records. Each purchase has an ID INT Identity(1,1) field, a ProductTypeID INT field, a PurchaseDateTime DATETIME field, and a TimeBetweenPurchasesOfProductType field. The purchases are inserted with a NULL in the TimeBetweenPurchasesOfProductType field. I'm trying to come up with a way, without using a costly cursor on such a huge table, to join this table and a second, aliased, copy of this table dealigned by one purchase of that product type. The problem, is that the last purchase of a product type doesn't happen sequentially. An Apple may have been purchased 309 purchases ago, so I can't just say ID-1 to find the previous apple purchase to calculate the difference in PurchaseDateTime for the TimeBetweenPurchasesOfProductType field. This problem is absolutely baffling me. I'm employing a faster solution using a trigger, and a TOP 1 ... WHERE Last.ID < Current.ID AND Last.ProductTypeID = Current.ProductTypeID ORDER BY DESC, but I still have to fill in all of the TimeBetweenPurchasesOfProductType fields in already existing data.
I could implement an O(n^2) cursor, but that is just bad business. I want to do this right.
Here is a sample of what I'm trying to get at.
(ProductType table is irrelevant, but imagine Apple = ProductType.ID 1, Beans = ProductType.ID 2, Roses = ProductType.ID 3, etc.)
DECLARE @Purchases TABLE
(
ID INT Identity(1,1) PRIMARY KEY,
ProductTypeID INT NOT NULL,
PurchaseDateTime DATETIME NOT NULL,
TimeBetweenPurchasesOfProductType DATETIME
)
INSERT INTO @Purchases
SELECT 1 [ProductTypeID], '6/1/2009 11:40:00.000' [PurchaseDateTime], NULL [TimeBetweenPurchasesOfProductType]
UNION
SELECT 2 [ProductTypeID], '6/1/2009 11:47:00.000' [PurchaseDateTime], NULL [TimeBetweenPurchasesOfProductType]
UNION
SELECT 2 [ProductTypeID], '6/1/2009 11:51:00.000' [PurchaseDateTime], NULL [TimeBetweenPurchasesOfProductType]
UNION
SELECT 3 [ProductTypeID], '6/1/2009 12:11:00.000' [PurchaseDateTime], NULL [TimeBetweenPurchasesOfProductType]
UNION
SELECT 1 [ProductTypeID], '6/1/2009 13:39:00.000' [PurchaseDateTime], NULL [TimeBetweenPurchasesOfProductType]
UNION
SELECT 3 [ProductTypeID], '6/1/2009 17:39:00.000' [PurchaseDateTime], NULL [TimeBetweenPurchasesOfProductType]
/*
Resultant Purchases table:
ID, ProductTypeID, PurchaseDateTime, TimeBetweenPurchasesOfProductType
--------------------------------------------------------------------------
1 1 (Apple) '6/1/2009 11:40:00.000' NULL
2 2 (Beans) '6/1/2009 11:47:00.000' NULL
3 2 (Beans) '6/1/2009 11:51:00.000' NULL
4 3 (Roses) '6/1/2009 12:11:00.000' NULL
5 1 (Apple) '6/1/2009 13:39:00.000' NULL
6 3 (Roses) '6/1/2009 17:39:00.000' NULL
I'm wanting to join this table with itself in such a way that it matches up the previous ProductTypeID so I can compute the time difference, (ISNULLs, and DATEDIFFs, etc. were left out for brevity):
A.ID ... B.ID ... A.PurchaseDateTime - B.PurchaseDateTime [TimeBetweenPurchasesOfProductID]
1 ... NULL ... NULL = 0
2 ... NULL ... NULL = 0
3 ... 2 ... '6/1/2009 11:51:00.000' - '6/1/2009 11:47:00.000' = 1:04
4 ... NULL ... NULL = 0
5 ... 1 ... '6/1/2009 13:39:00.000' - '6/1/2009 11:40:00.000' = 1:59
6 ... 4 ... '6/1/2009 17:39:00.000' - '6/1/2009 12:11:00.000' = 7:28
*/
Matching up the two ID fields is my only concern, I can deal with the calculations and reintegrating the actual data into the join after the fact. I'm just stumped on how to join this table with itself to get the previous row of the same type! I keep getting back-referencing issues in almost everything I come up with. And a cursor is undesired in this case due to the size and the amount of places I'll want to reuse this back-search for additional calculation on a running database.
Any help you can provide would be outstanding! I'm stumped!!! You can E-mail me at mraarone et yah00 d0t c0m if you'd like.
[/font]
June 4, 2009 at 12:00 pm
For some reason, I can't post the code directly. Please see the attached file for something I hope helps.
June 4, 2009 at 12:07 pm
Lynn Pettis (6/4/2009)
For some reason, I can't post the code directly. Please see the attached file for something I hope helps.
Hey Lynn, I've been trying to do this query for the past Hour with a recursive CTE, wouldn't that be faster on such a big table?
I know I've seen a post somewhere with either Jeff or Grant posting something really neat for self recursive queries like this one...
My bad, can't get it to work, 😉
Cheers,
J-F
June 4, 2009 at 12:11 pm
J-F Bergeron (6/4/2009)
Lynn Pettis (6/4/2009)
For some reason, I can't post the code directly. Please see the attached file for something I hope helps.Hey Lynn, I've been trying to do this query for the past Hour with a recursive CTE, wouldn't that be faster on such a big table?
I know I've seen a post somewhere with either Jeff or Grant posting something really neat for self recursive queries like this one...
My bad, can't get it to work, 😉
No. This is a variation of the Running Totals problem. However, thinking about it, there is a way using a CTE and the row_number function that would probable work as well.
June 4, 2009 at 12:30 pm
Here is a variation using a CTE and the row_number() function. Not wanting to fight with it, I am just sending it up as a attachment as well.
June 4, 2009 at 1:06 pm
Lynn Pettis (6/4/2009)
Here is a variation using a CTE and the row_number() function. Not wanting to fight with it, I am just sending it up as a attachment as well.
Ahh, Thanks Lynn, I was trying to use the ID directly, but with the row_number, it makes easier to know you always a gap of 1. That is the explanation why I was not able to build the query.
Thanks again,
Cheers,
J-F
June 4, 2009 at 2:41 pm
Thank you all very much!!! Absolutely useful! This was all mind bottling. I had to do some research and trial to understand what all is going on with your innovative code. Here's my take on the real-time trigger part to handle the real-time inserts with calculations. The clustered index will reorder my huge table, understood. This can be utilized for efficient, BY ProductTypeID, activities, specifically, this is useful for an after-hours update to initially fix the data already in the table, but future real-time calculations will have to be handled another way since I can't reindex the table between inserts. So, I just wrote an example trigger utilizing the fact there is only a few thousand ProductTypes in the irrelevant ProductTypes table (which holds the dereference of the ProductTypeID). I'll post, below, what I've got. This issue is similar, but external to the indexing solutions above. However, this real-time calculation issue is still a crux in regards to overall speed since every "insert into" of a few hundred rows will force a MAX-GROUP BY calculation on the entire 50 million record table. You guys are so much smarter than me at this. Even though the MAX calculation doesn't take that long, the real-time part of the insert functionality adds up. What can I do to integrate some of the concepts you guys used above to the following (this is a problem that's made me bang my head against my keyboard for 3 and a half weeks, mind you):
(PS: Lynn, I possibly just had the same pasting problem you did. Fixed it by pasting my code from Management Studio into notepad, and could copy/paste it into this window without any issues.)
--==== Create the Purchases table.
CREATE TABLE dbo.Purchases
(
ID INT Identity(1,1) PRIMARY KEY,
ProductTypeID INT NOT NULL,
PurchaseDateTime DATETIME NOT NULL,
TimeBetweenPurchasesOfProductType int null -- Modified to hold time in seconds
);
GO
--==== Insert some initial data.
INSERT INTO dbo.Purchases (ProductTypeID, PurchaseDateTime, TimeBetweenPurchasesOfProductType)
SELECT 1, '6/1/2009 11:40:00.000', NULL
INSERT INTO dbo.Purchases (ProductTypeID, PurchaseDateTime, TimeBetweenPurchasesOfProductType)
SELECT 2, '6/1/2009 11:47:00.000', NULL
INSERT INTO dbo.Purchases (ProductTypeID, PurchaseDateTime, TimeBetweenPurchasesOfProductType)
SELECT 2, '6/1/2009 11:51:00.000', NULL
INSERT INTO dbo.Purchases (ProductTypeID, PurchaseDateTime, TimeBetweenPurchasesOfProductType)
SELECT 3, '6/1/2009 12:11:00.000', NULL
INSERT INTO dbo.Purchases (ProductTypeID, PurchaseDateTime, TimeBetweenPurchasesOfProductType)
SELECT 1, '6/1/2009 13:39:00.000', NULL
INSERT INTO dbo.Purchases (ProductTypeID, PurchaseDateTime, TimeBetweenPurchasesOfProductType)
SELECT 3, '6/1/2009 17:39:00.000', NULL
GO
SELECT * FROM dbo.Purchases
GO
--==== Create the trigger to get a snapshot of all the greatest ID numbers of each ProductTypeID.
--==== Join the Purchase table to the greatest ID for each ProductTypeID to get the data for the
--==== last Purchase for each ProductTypeID. Next, inner join that with the "inserted" built-in table.
--==== Finally, perform calculations between the current and previous records, inserting
--==== the post-calculation amended information into the Purchases table.
CREATE TRIGGER trInsteadOfUpdateWithCalculations
ON dbo.Purchases
INSTEAD OF INSERT
AS
BEGIN
INSERT INTO dbo.Purchases
SELECT
inserted.ProductTypeID,
inserted.PurchaseDateTime,
DATEDIFF("mi", PreviousPurchases.PurchaseDateTime, inserted.PurchaseDateTime) [TimeBetweenPurchasesOfProductType]
FROM
(
SELECT ProductTypeID, MAX(ID) [PreviousPurchaseID]
FROM dbo.Purchases
GROUP BY ProductTypeID
) PreviousPurchaseIDs
INNER JOIN inserted ON inserted.ProductTypeID = PreviousPurchaseIDs.ProductTypeID
INNER JOIN dbo.Purchases PreviousPurchases ON PreviousPurchases.ID = PreviousPurchaseIDs.PreviousPurchaseID
END
GO
--==== There will not be repeats of the same ProductTypeID in each insert.
--==== Notice this is a mass insert, rather than row-by-row. Test the trigger.
INSERT INTO dbo.Purchases (ProductTypeID, PurchaseDateTime, TimeBetweenPurchasesOfProductType)
SELECT 1, '6/1/2009 15:09:00.000', NULL UNION ALL
SELECT 2, '6/1/2009 14:15:00.000', NULL UNION ALL
SELECT 3, '6/1/2009 19:12:00.000', NULL
GO
--==== Display what the above inserts, and subsequent trigger "INSTEAD OF" inserts produced.
SELECT * FROM dbo.Purchases
GO
--==== Drop the trigger.
DROP TRIGGER trInsteadOfUpdateWithCalculations
GO
--==== Drop the dbo.Purchases table.
DROP TABLE dbo.Purchases
GO
June 4, 2009 at 2:59 pm
Hi,
I'm not sure I get what you are looking for here. You say you now have a after hour solution to fix your data (date between last purchase). Now, do you really need to also implement a real-time solution?
I would guess your situation would be either you implement it on the fly as a real-time solution, or you might wanna do it daily, after hours?
If you are looking for a trigger that does that, and that you can say without any doubt that no equal ProductTypeID will be inserted at once, well your solution seems to work perfectly.
Is it only the performance that is disturbing your solution? Because it will be hard to get a "On the fly solution" for a 50Million rows table...
Cheers,
J-F
June 4, 2009 at 3:19 pm
Yes sir, you are correct. This is another part of the solution, not disconnected, though. I HAVE 50 million rows that need to be fixed, which the above solutions work outstandingly great to help do that! But this is an ongoing data collection. I need these calculations to keep coming in, and to be real-time up-to-date without having to reindex to get the calculated data. The trigger was mentioned in my original post.
The trigger does work, but it's the best I could come up with, performance-wise. I don't understand completely how much performance increase you can get out of proper index usage. Considering the speed of your responses and your level of understanding, I was hoping to gain more insight into working with something this huge. I'm on the fence about reindexing the table every night just for running totals. It seems like something that should be done on the fly if you have the option. And having real-time data throughout the day is much preferred to only being able to report statistics / calculated fields the following day.
*sigh* it is a 50 million row table... I'm just hoping there's a better solution, I'm maxed out.
June 4, 2009 at 3:44 pm
The first thing I'd like to emphaize is that having your clustered index on the ID column (identity column) is not necessarily the best idea. You also need to look at how the data is accessed on a regular basis. If you have a large number of queries that access the table using the Product ID or Date columns, especially range based queries, then that may be where you want to use your one and only clustered index.
I didn't actually catch the trigger part of your post, as I focused more on the mass update of the table. I'll have to look a little more closely at the use of a trigger, as then you are hopefully narrowing the number of records that need to be updated depending on how many records are inserted in a single batch; are the records inserted one row at a time or are multiple row inserts being accomplished.
June 4, 2009 at 4:36 pm
Here is my test code creating a trigger to handle realtime updates. Please note that I create the trigger after inserting and updating the initial test data first.
Unfortunately, I still can't post directly to this thread even using Notepad first.
June 4, 2009 at 5:35 pm
Lynn,
The inserts are a couple hundred at a time, which may help since it will scale back the number of computations during each insert by that amount. Each row to be inserted is also of distinct ProductType, which is great so I don't have to consider using any cursors upon insert.
I am considering how to reconfigure the table's index to help during the insert trigger. I believe you hit the nail on the head with indexing by ProductTypeID and Date. The actual tables i'm using are much wider than the examples I'm giving here, which is part of my consideration. I need to read up more on indexing to determine if it'd be possible to use multiple indexes. The trigger uses Product.ID to find the MAX, which is why, right now, it's not too slow. Once I alter the indexing, however, I expect that to change. That's my greatest concern. But I do want to index it, because all other calculations and statistics will focus on the "ProductTypeID" and "Date" fields.
After thinking through it, I'm sure the trigger and the reclustering / reindexing will not be compatible performance-wise. In one hand, I'll have faster ProductTypeID based calculations, but the inserts will be slower which will affect the real-time performance. If I don't index, global "Purchases" computations will be faster which includes the trigger, but the real-time insert trigger will take the hit -- which is why I believe I need a new trigger that is more compatible with the index fields "ProductTypeID" and "PurchaseDateTime" instead of the "ID" in the MAX(ID) part of the trigger.
June 4, 2009 at 7:05 pm
Have you looked at my trigger code? It doesn't use MAX() and it is an AFTER INSERT trigger, not an INSTEAD OF trigger.
The inserts may be somewhat slower, but with the requirement of updating the "Time between ProductType Purchases" column, I think it may be worthwhile. The only thing you can do is put it in a test environment and test it to see what happens.
June 4, 2009 at 7:22 pm
Lynn,
I am so sorry, I posted that message after it sat on my desktop for an hour and have been trying to edit it via iPhone for the past hour and a half. Your code is fantastic. I'm going to run some tests on it this weekend, but I believe that is EXACTLY what I was looking for. Long day, can't believe I didn't think to take the code from your second attachment and put it in a "FOR AFTER INSERT" UPDATE clause. Slapping forehead, for almost the past month I've been dead set on "INSTEAD OF INSERT," so that tripped me up.
I'm going to start testing tomorrow night, I'll post another reply then to keep you up to date. I'm very excited, and extremely appreciative for all of your help, I really needed it.
Thank you,
Aaron
June 10, 2009 at 3:33 pm
Okay, I was able to test the above solutions. First off, the CTE above uses about 4.8Gs of ram, and causes SQL Server to buffer about 7.5Gs of data to disk for the 50 million row table. It takes an hour to do the bulk running totals calculations, which is not too bad considering a cursor would possibly take a lot longer.
The Trigger from above is not a possibility for 1000's of realtime inserts / second. So, what I had to do is create a "RunningTotals" table with a 1:1 relationship to ProductTypes, and create a "for update" trigger to be an update to the table instead of insert, which subsequently uses the same trigger above to update a second time with the calculations followed by an insert into the master "Purchases" table of the previous row of data prior to the update (in the "deleted" table within the trigger). This exponentiates the speed to real-time.
Any statistical data that needs to be calculated on each subsequent Purchase with the same ProductTypeID can then be stored in a minimized fashion in the "RunningTotals" table prior to it's insert of just the minimal amount of data into the very large Purchases table. And if statistical data is required for a current snapshot of averages, standard deviations or variances, or custom statistical models can then be accumulated via the "for update" trigger.
This is the most trimmed way I can determine to calculate this. Even the previous trigger I recommended performs with competitive speed when limiting the inner select (or P1 CTE from Lynn's example) to just a few thousand * 2 product types instead of 50 million * 2.
The code is much more complicated than above so altering my production code to the above example will take some time, but I will post this for completion.
-- Aaron
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply