October 15, 2009 at 5:15 pm
Hi all,
Can someone send me in the right direction regarding the following issue:
I have a table with companies, employees and datetime fields. Each record contains one trip made by the employee. What I am trying to create is the fastest query to calculate the pause between 2 trips.
The table:
- ID (Unique)
- Company_ID
- Employee_ID
- Start_Trip
- Stop_Trip
To calculate the pause between 2 trips I have to join a record with the next record in chronological order. I cannot enter an indexed number in the table because trips are deleted, changed and added not in a chronological order. The number of records in the table is around 3M
What I tried so far:
Select top 10000 t1.Employee_ID, t1.id, t1.Stop_Trip , Next_Start_Trip =
(Select min(t2.Start_Trip) From Trips t2
Where t1.Company_ID = t2.Company_ID
And t1.Employee_ID = t2.Employee_ID
And t1.Stop_Trip < t2.Start_trip)
Execution time 00:07:02
Secondly I tried to create a view with a row number. The downside of this approach is that I cannot create an index on the view because of the Row_Number.
CREATE VIEW [dbo].[vFieldLog_Visits] AS
Select Company_ID, Employee_ID, id, Start_trip, Stop_trip,
ROW_NUMBER() OVER (Partition BY Company_ID,Employee_ID ORDER BY Start_Trip) AS rownum
From dbo.Trips Where Employee_ID != '0000000000'
Select top 10000 currow.Employee_ID, currow.id,
currow.Stop_Trip as Stop_Trip,
nextrow.Start_Trip as Next_Start_Trip
From vFieldLog_Visits currow
Inner Join vFieldLog_Visits nextrow
On currow.Company_ID = nextrow.Company_ID
And currow.Employee_ID = nextrow.Employee_ID
And currow.rownum = (nextrow.rownum - 1)
Execution time 00:02:32
I also tried using a CTE:
WITH Trips_X AS (
SELECT Company_ID,Employee_ID,Start_Trip,Stop_Trip,id,
ROW_NUMBER() OVER (Partition BY Company_ID,Employee_ID ORDER BY Start_Trip) AS rownum
FROM Trips Where Employee_ID != '0000000000')
Select top 10000 currow.Employee_ID, currow.id,
currow.Stop_Trip as Stop_Trip,
nextrow.Start_Trip as Next_Start_Trip
From Trips_X currow
Inner Join Trips_X nextrow
On currow.Company_ID = nextrow.Company_ID
And currow.Employee_ID = nextrow.Employee_ID
And currow.rownum = nextrow.rownum - 1
Execution time: 00:02:48
The queries are ok when you are trying to calculate a huge amount of pauses at once, but if you try to calculate a small amount of pauses then the execution times are not acceptable.
I tried to find a way to add a dynamic row number to the table on which I can create an index. Maybe even a view with an index. But what ever I try there is always something preventing me from achieving this.
Any ideas?
October 15, 2009 at 6:44 pm
Thanks for providing what you've already tried. Take a look at the article in my signature on Running Totals. I believe this to be the perfect method for this problem (aside from extraordinarily complex string manipulation methods).
Jeff is working on that article at the moment and there's not a lot of information, but if you'd provide some table schema and sample data (click the link in my signature on how to provide sample data to see how to do so in a manner we can use), I'll help you write the query.
It would be a stored procedure though, this method can't be used as a view. Not sure if that is acceptable in your situation.
October 15, 2009 at 8:36 pm
Hi, Thank you Garadin for willing to help me!
A stored procedure is no problem.
Sample data:
--===== Create the test table with
CREATE TABLE dbo.Trips2(
[id] [bigint] NOT NULL,
[Company_ID] [int] NOT NULL,
[Employee_ID] [nvarchar](50) NOT NULL,
[Start_Trip] [datetime] NOT NULL,
[Stop_Trip] [datetime] NOT NULL)
--===== Insert the test data into the test table
INSERT INTO dbo.Trips2
(id, Company_ID, Employee_ID, Start_Trip, Stop_Trip)
SELECT '114',5000,'0000000001','Oct 17 2007 1:00AM','Oct 17 2007 1:02AM' UNION ALL
SELECT '137',5000,'0000000001','Oct 17 2007 1:13AM','Oct 17 2007 1:17AM' UNION ALL
SELECT '144',5000,'0000000001','Oct 17 2007 1:21AM','Oct 17 2007 1:57AM' UNION ALL
SELECT '154',5000,'0000000001','Oct 17 2007 2:00AM','Oct 17 2007 2:13AM' UNION ALL
SELECT '155',5000,'0000000001','Oct 17 2007 2:31AM','Oct 17 2007 2:35AM' UNION ALL
SELECT '181',5000,'0000000002','Oct 17 2007 2:36AM','Oct 17 2007 2:53AM' UNION ALL
SELECT '186',5000,'0000000002','Oct 17 2007 3:01AM','Oct 17 2007 3:21AM' UNION ALL
SELECT '196',5000,'0000000003','Oct 17 2007 3:31AM','Oct 17 2007 3:45AM' UNION ALL
SELECT '214',5001,'0000000001','Oct 17 2007 1:00AM','Oct 17 2007 1:02AM' UNION ALL
SELECT '237',5001,'0000000001','Oct 17 2007 1:13AM','Oct 17 2007 1:17AM' UNION ALL
SELECT '244',5001,'0000000001','Oct 17 2007 1:21AM','Oct 17 2007 1:57AM' UNION ALL
SELECT '254',5001,'0000000001','Oct 17 2007 2:00AM','Oct 17 2007 2:13AM' UNION ALL
SELECT '255',5001,'0000000001','Oct 17 2007 2:31AM','Oct 17 2007 2:35AM' UNION ALL
SELECT '281',5001,'0000000001','Oct 17 2007 2:36AM','Oct 17 2007 2:53AM' UNION ALL
SELECT '286',5001,'0000000001','Oct 17 2007 3:01AM','Oct 17 2007 3:21AM' UNION ALL
SELECT '296',5001,'0000000001','Oct 17 2007 3:31AM','Oct 17 2007 3:45AM' UNION ALL
SELECT '314',5001,'0000000001','Oct 17 2007 4:00AM','Oct 17 2007 4:02AM' UNION ALL
SELECT '337',5001,'0000000001','Oct 17 2007 4:13AM','Oct 17 2007 4:17AM' UNION ALL
SELECT '344',5001,'0000000002','Oct 17 2007 4:21AM','Oct 17 2007 4:57AM' UNION ALL
SELECT '354',5001,'0000000002','Oct 17 2007 5:00AM','Oct 17 2007 5:13AM' UNION ALL
SELECT '355',5001,'0000000002','Oct 17 2007 5:31AM','Oct 17 2007 5:35AM' UNION ALL
SELECT '381',5001,'0000000002','Oct 17 2007 5:36AM','Oct 17 2007 5:53AM' UNION ALL
SELECT '386',5001,'0000000003','Oct 17 2007 6:01AM','Oct 17 2007 6:21AM' UNION ALL
SELECT '396',5001,'0000000003','Oct 17 2007 6:31AM','Oct 17 2007 6:45AM'
I am not sure if this will be enough for peromance testing but is will be enough for validating a query. If you need more please let me know!
I am very interessted in your 'running totals' article. I will definitely read it when it is ready.
October 15, 2009 at 10:36 pm
a composite index with company Id, employeeId and start_trip in it should speed it up quite a bit.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 15, 2009 at 11:14 pm
You are right! I am sorry for leaving this information out...
In dbo.Trips I have 2 indexes:
- Unique nonclustered index on ID
- Unique clustered index on Company_ID, Employee_ID, Start_trip
October 16, 2009 at 12:15 am
Excellent, thanks Tim, I'll whip something up for you in the morning. Is adding a field to the table an issue? Ordinarily, I'd do without, but it looks like you already have the index I need to do this, and that'd speed it up considerably. I fully understand that you can't rely on a sequence column to stay accurate over time because of changes made to the base data, but this solution would actually re-populate that field each time.
Small point of clarification though, it's Jeff's article, not mine ;-).
October 16, 2009 at 4:40 am
No, adding a column is not a problem. What should be concidered is that the amount of io's to this table is pretty huge. Mostly selects and inserts. Not so many updates.
If the extra column would mean a sequential index, will this not become a problem when inserting old data? It looks to me like this would slow down the table enormously. If not... Then this would be a great solution.
October 16, 2009 at 8:30 am
The new column(s) do not need to be added to the index, they just allow you to use this method without creating a temporary table. I've given you two methods below. The first just sequences, the second both sequences and calculates the difference in minutes between the two trips. I used #trips2 and #trips so that you could run this as is to see how it works.
--===== Create the test table with
CREATE TABLE dbo.#Trips2(
[id] [bigint] NOT NULL,
[Company_ID] [int] NOT NULL,
[Employee_ID] [nvarchar](50) NOT NULL,
[Start_Trip] [datetime] NOT NULL,
[Stop_Trip] [datetime] NOT NULL,
[Seq]int)-- Added Column
-- Clustered index that already exists
CREATE CLUSTERED INDEX Quirk ON #Trips2 (Company_ID, Employee_ID, Start_Trip)
--===== Insert the test data into the test table
INSERT INTO dbo.#Trips2
(id, Company_ID, Employee_ID, Start_Trip, Stop_Trip)
SELECT '114',5000,'0000000001','Oct 17 2007 1:00AM','Oct 17 2007 1:02AM' UNION ALL
SELECT '137',5000,'0000000001','Oct 17 2007 1:13AM','Oct 17 2007 1:17AM' UNION ALL
SELECT '144',5000,'0000000001','Oct 17 2007 1:21AM','Oct 17 2007 1:57AM' UNION ALL
SELECT '154',5000,'0000000001','Oct 17 2007 2:00AM','Oct 17 2007 2:13AM' UNION ALL
SELECT '155',5000,'0000000001','Oct 17 2007 2:31AM','Oct 17 2007 2:35AM' UNION ALL
SELECT '181',5000,'0000000002','Oct 17 2007 2:36AM','Oct 17 2007 2:53AM' UNION ALL
SELECT '186',5000,'0000000002','Oct 17 2007 3:01AM','Oct 17 2007 3:21AM' UNION ALL
SELECT '196',5000,'0000000003','Oct 17 2007 3:31AM','Oct 17 2007 3:45AM' UNION ALL
SELECT '214',5001,'0000000001','Oct 17 2007 1:00AM','Oct 17 2007 1:02AM' UNION ALL
SELECT '237',5001,'0000000001','Oct 17 2007 1:13AM','Oct 17 2007 1:17AM' UNION ALL
SELECT '244',5001,'0000000001','Oct 17 2007 1:21AM','Oct 17 2007 1:57AM' UNION ALL
SELECT '254',5001,'0000000001','Oct 17 2007 2:00AM','Oct 17 2007 2:13AM' UNION ALL
SELECT '255',5001,'0000000001','Oct 17 2007 2:31AM','Oct 17 2007 2:35AM' UNION ALL
SELECT '281',5001,'0000000001','Oct 17 2007 2:36AM','Oct 17 2007 2:53AM' UNION ALL
SELECT '286',5001,'0000000001','Oct 17 2007 3:01AM','Oct 17 2007 3:21AM' UNION ALL
SELECT '296',5001,'0000000001','Oct 17 2007 3:31AM','Oct 17 2007 3:45AM' UNION ALL
SELECT '314',5001,'0000000001','Oct 17 2007 4:00AM','Oct 17 2007 4:02AM' UNION ALL
SELECT '337',5001,'0000000001','Oct 17 2007 4:13AM','Oct 17 2007 4:17AM' UNION ALL
SELECT '344',5001,'0000000002','Oct 17 2007 4:21AM','Oct 17 2007 4:57AM' UNION ALL
SELECT '354',5001,'0000000002','Oct 17 2007 5:00AM','Oct 17 2007 5:13AM' UNION ALL
SELECT '355',5001,'0000000002','Oct 17 2007 5:31AM','Oct 17 2007 5:35AM' UNION ALL
SELECT '381',5001,'0000000002','Oct 17 2007 5:36AM','Oct 17 2007 5:53AM' UNION ALL
SELECT '386',5001,'0000000003','Oct 17 2007 6:01AM','Oct 17 2007 6:21AM' UNION ALL
SELECT '396',5001,'0000000003','Oct 17 2007 6:31AM','Oct 17 2007 6:45AM'
--------------- Option 1 : Sequence the Trips ------------
DECLARE @Seq int,
@CompanyID int,
@EmployeeID varchar(20)
UPDATE #Trips2
SET @Seq = Seq = CASE WHEN Company_ID = @CompanyID AND Employee_ID = @EmployeeID THEN @Seq + 1 ELSE 1 END,
@CompanyID = Company_ID,
@EmployeeID = Employee_ID
FROM #Trips2 OPTION (MAXDOP 1)
-- Show the results
SELECT * FROM #Trips2
GO
--===== Create the test table with
CREATE TABLE dbo.#Trips(
[id] [bigint] NOT NULL,
[Company_ID] [int] NOT NULL,
[Employee_ID] [nvarchar](50) NOT NULL,
[Start_Trip] [datetime] NOT NULL,
[Stop_Trip] [datetime] NOT NULL,
[Seq]int,-- Added Column
[Diff]int)-- Added Column
-- Clustered index that already exists
CREATE CLUSTERED INDEX Quirk ON #Trips (Company_ID, Employee_ID, Start_Trip)
--===== Insert the test data into the test table
INSERT INTO dbo.#Trips
(id, Company_ID, Employee_ID, Start_Trip, Stop_Trip)
SELECT '114',5000,'0000000001','Oct 17 2007 1:00AM','Oct 17 2007 1:02AM' UNION ALL
SELECT '137',5000,'0000000001','Oct 17 2007 1:13AM','Oct 17 2007 1:17AM' UNION ALL
SELECT '144',5000,'0000000001','Oct 17 2007 1:21AM','Oct 17 2007 1:57AM' UNION ALL
SELECT '154',5000,'0000000001','Oct 17 2007 2:00AM','Oct 17 2007 2:13AM' UNION ALL
SELECT '155',5000,'0000000001','Oct 17 2007 2:31AM','Oct 17 2007 2:35AM' UNION ALL
SELECT '181',5000,'0000000002','Oct 17 2007 2:36AM','Oct 17 2007 2:53AM' UNION ALL
SELECT '186',5000,'0000000002','Oct 17 2007 3:01AM','Oct 17 2007 3:21AM' UNION ALL
SELECT '196',5000,'0000000003','Oct 17 2007 3:31AM','Oct 17 2007 3:45AM' UNION ALL
SELECT '214',5001,'0000000001','Oct 17 2007 1:00AM','Oct 17 2007 1:02AM' UNION ALL
SELECT '237',5001,'0000000001','Oct 17 2007 1:13AM','Oct 17 2007 1:17AM' UNION ALL
SELECT '244',5001,'0000000001','Oct 17 2007 1:21AM','Oct 17 2007 1:57AM' UNION ALL
SELECT '254',5001,'0000000001','Oct 17 2007 2:00AM','Oct 17 2007 2:13AM' UNION ALL
SELECT '255',5001,'0000000001','Oct 17 2007 2:31AM','Oct 17 2007 2:35AM' UNION ALL
SELECT '281',5001,'0000000001','Oct 17 2007 2:36AM','Oct 17 2007 2:53AM' UNION ALL
SELECT '286',5001,'0000000001','Oct 17 2007 3:01AM','Oct 17 2007 3:21AM' UNION ALL
SELECT '296',5001,'0000000001','Oct 17 2007 3:31AM','Oct 17 2007 3:45AM' UNION ALL
SELECT '314',5001,'0000000001','Oct 17 2007 4:00AM','Oct 17 2007 4:02AM' UNION ALL
SELECT '337',5001,'0000000001','Oct 17 2007 4:13AM','Oct 17 2007 4:17AM' UNION ALL
SELECT '344',5001,'0000000002','Oct 17 2007 4:21AM','Oct 17 2007 4:57AM' UNION ALL
SELECT '354',5001,'0000000002','Oct 17 2007 5:00AM','Oct 17 2007 5:13AM' UNION ALL
SELECT '355',5001,'0000000002','Oct 17 2007 5:31AM','Oct 17 2007 5:35AM' UNION ALL
SELECT '381',5001,'0000000002','Oct 17 2007 5:36AM','Oct 17 2007 5:53AM' UNION ALL
SELECT '386',5001,'0000000003','Oct 17 2007 6:01AM','Oct 17 2007 6:21AM' UNION ALL
SELECT '396',5001,'0000000003','Oct 17 2007 6:31AM','Oct 17 2007 6:45AM'
---------------- Option 2 : Seq + Calc the Time Diff -------------
DECLARE @Seqint,
@CompanyIDint,
@EmployeeID varchar(20),
@PrevEnddatetime,
@Diffint
UPDATE #Trips
SET @Seq = Seq = CASE WHEN Company_ID = @CompanyID AND Employee_ID = @EmployeeID THEN @Seq + 1 ELSE 1 END,
@Diff = Diff = CASE WHEN Company_ID = @CompanyID AND Employee_ID = @EmployeeID THEN DATEDIFF(n,@PrevEnd,Start_Trip) ELSE NULL END,
@CompanyID = Company_ID,
@EmployeeID = Employee_ID,
@PrevEnd = Stop_Trip
FROM #Trips OPTION (MAXDOP 1)
-- View the Results
SELECT * FROM #Trips
October 16, 2009 at 5:30 pm
Thanks a lot. I am going to test this tomorrow and post the results.
Regards, Tim
October 18, 2009 at 1:35 pm
:unsure: Sorry if I am mistaking but, this way it looks like I need to recalculate and update 30 Million records each time before I can query the needed time differences. The sequence of the records can change any moment records are not update and inserted chronological. I doubt if 30 million updates will be the fastest way to handle this issue.
For example:
To query the sum of all time differences per employee (15.000 employees) of 10 days ago:
Select Employee_ID, Sum(TimeDiff)
From (
Select t1.Employee_ID, Datediff("ss", t1.Stop_Trip,
(Select min(t2.Start_Trip) From Trips t2
Where t1.Company_ID = t2.Company_ID
And t1.Employee_ID = t2.Employee_ID
And t1.Stop_Trip < t2.Start_trip)) as TimeDiff
From Trips t1
Where cast(cast(t1.Start_Trip as integer) as datetime)
= cast(cast(DateAdd("dd", -10, getdate()) as integer) as datetime)
) As RefTable
Group By Employee_ID
This will work but it takes me over 3 minutes to get the result.
If I had a sequential index, like you adviced, it will take me a few seconds, but the problem is maintaining this column. When users are changing the order of the records in high rate the sequential index should be recalculated. This will slow down the database.
Regards,
Tim
October 18, 2009 at 2:42 pm
If the total of the timediffs is the intended result - why go through all of that trouble? In every case - that's going to be:
Max(stop time) - Min(start time) -sum(stop - start)
forget the sequential index in that case. what I WOULD go for is a calculated column storing the datediff of stop and start times. persist that, and you should have a fairly fast option with an index.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 18, 2009 at 3:16 pm
Hi there,
trying this out on around 9 million rows and just updating those records that I needed to execution completed in a couple of seconds.
DECLARE @Seq int,
@CompanyID int,
@EmployeeID varchar(20),
@PrevEnd datetime,
@Diff int
UPDATE Trips
SET @Seq = Seq = CASE WHEN Company_ID = @CompanyID AND Employee_ID = @EmployeeID THEN @Seq + 1 ELSE 1 END,
@Diff = Diff = CASE WHEN Company_ID = @CompanyID AND Employee_ID = @EmployeeID THEN DATEDIFF(n,@PrevEnd,Start_Trip) ELSE NULL END,
@CompanyID = Company_ID,
@EmployeeID = Employee_ID,
@PrevEnd = Stop_Trip
FROM Trips
WHERE
Start_Trip >= DATEADD(DD, 0, DATEDIFF(DD, 0, DATEADD(DD, -10, GETDATE())))
AND Start_Trip < DATEADD(DD, 0, DATEDIFF(DD, 0, DATEADD(DD, -9, GETDATE())))
OPTION (MAXDOP 1)
SELECT
Employee_ID, SUM(Diff)
FROM Trips
WHERE
Start_Trip >= DATEADD(DD, 0, DATEDIFF(DD, 0, DATEADD(DD, -10, GETDATE())))
AND Start_Trip < DATEADD(DD, 0, DATEDIFF(DD, 0, DATEADD(DD, -9, GETDATE())))
GROUP BY Employee_ID
Updating the entire c. 9,000,000 record table took under 30 seconds on my desktop machine.
Allister
October 18, 2009 at 3:31 pm
@Allister: Are you able to make this query without the need for the new columns sec or diff? If I do need to recalculate these column each time I make the query, I do not need to store them right?
October 18, 2009 at 4:05 pm
Updating the entire c. 9,000,000 record table took under 30 seconds on my desktop machine.
Did you really update 9M records in 30 seconds? That is fast. How come? When I try to update our tables simply changing a true or false will take more than 5 minutes for 9M records. Does it have somethink to do with the "OPTION (MAXDOP 1)" ?
Regards,
Tim
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply