July 31, 2007 at 8:08 am
Hi
First of all, I'm very far from being an expert on any SQL at all, so my code probably looks unnecessarily complicated and this is probably a newbie question, even if a bit of a hassle to solve...
I am limited in the respect that I'm only able to preform queries, I do not have the right to post data, create views nor functions.
I have a fairly large table to work with (>3.000.000 rows), and I'm trying to extract statistics from it.
The table covers "events" and contains the time of the events and the station it concerns, and so on...
"system update" is event_type_id=5, and I want to calculate statistics for crashes (event_type_id=2) ONLY for those stations that have been updated.
(maths: amount of crashes divided by the amount of days it has been a certain version, over the entire population of stations that have been updated)
Here is my code, it's one of two, of which the other covers the other version (614 instead of 597)
declare @stop datetime
set @stop='2007/07/03'
/*the last day the database was open for inserting events*/
select CAST(count(id) AS float)/SUM(daysOld)
from events,
(
SELECT station_id, MIN(event_date_time) as start,
MIN(change_date) as change,
/*If any of the days count is null they will be replaced by the total amount of days or 0*/
ISNULL(CAST(@stop AS float)-CAST(MIN(change_date) AS float),0)
AS daysNew,
ISNULL(CAST(MIN(change_date) AS float)-CAST(MIN(event_date_time) AS float),
CAST(@stop AS float)-CAST(MIN(event_date_time) AS float))
AS daysOld
from events
LEFT JOIN
(
SELECT station_id as changed_station, event_date_time as change_date
FROM events
WHERE event_type_id=5
)
change
ON station_id=change.changed_station
group by station_id
)
tempus
WHERE event_type_id=2 AND version=597 AND tempus.station_id in
(select distinct station_id from events where event_type_id=5)
I gave this five minutes of execution before I interrupted it. If I remove the "in (select)", it completes the query in a matter of seconds.
It does not have any problems when I set tempus.station_id to equal a specific number in the WHERE clause, only when I add the "in(select)"
Please note that this code is in "trial and horror" phase, so I'm doing some unnecessary selects merely because I used them in a previous attempt and I'm not ready to file them away quite yet.
I would appreciate any ideas of how to solve this, because I have spent too much worktime on it already.
July 31, 2007 at 9:49 am
Marie-Sofie
Does this get you any closer? I've left it to you to do the manipulation of the figures. Obviously it's not tested since there was no sample data, and performance will depend on indexing and other factors.
WITH
UpdatedStations (station_id, DaysSinceUpdate) AS (
SELECT station_id, DATEDIFF(day, MAX(event_date_time), GETDATE())
FROM events
WHERE event_id = 5
GROUP BY station_id)
select station_id, DaysSinceUpdate, count(*)
from UpdatedStations u
join events e
on u.station_id = e.station_id
where e.event_id = 2
group by station_id, DaysSinceUpdate
John
July 31, 2007 at 1:00 pm
Try exists instead
WHERE event_type_id=2 AND version=597 AND tempus.station_id in
(select distinct station_id from events where event_type_id=5)
WHERE event_type_id=2 AND version=597
AND EXISTS (select * from events as p9 where p9.event_type_id=5 and tempus.station_id = p9.station_id)
N 56°04'39.16"
E 12°55'05.25"
July 31, 2007 at 1:35 pm
-- Prepare sample data
DECLARE @Events TABLE (StationID INT, EventDateTime DATETIME, EventTypeID INT, Version INT)
INSERT @Events
SELECT 1, '20070101', 5, 597 UNION ALL
SELECT 1, '20070301', 2, NULL UNION ALL
SELECT 1, '20070304', 5, 597 UNION ALL
SELECT 1, '20070720', 2, NULL UNION ALL
SELECT 1, '20070721', 2, NULL UNION ALL
SELECT 1, '20070725', 5, 614 UNION ALL
SELECT 2, '20070211', 5, 322 UNION ALL
SELECT 2, '20070212', 5, 597 UNION ALL
SELECT 2, '20070302', 2, NULL UNION ALL
SELECT 2, '20070303', 5, 597 UNION ALL
SELECT 2, '20070529', 2, NULL UNION ALL
SELECT 2, '20070611', 2, NULL UNION ALL
SELECT 2, '20070701', 5, 614
;WITH Install (StationID, thisDT, Item, Grp, Version)
AS (
SELECT StationID,
EventDateTime,
ROW_NUMBER() OVER (PARTITION BY StationID ORDER BY EventDateTime) AS Item,
ROW_NUMBER() OVER (ORDER BY StationID, EventDateTime) AS Grp,
Version
FROM @Events
WHERE EventTypeID = 5
)
,Mtrx (StationID, Version, thisDT, nextDT, Today, Grp)
AS (
SELECT i1.StationID,
i1.Version,
i1.thisDt,
COALESCE(i2.thisDt, CURRENT_TIMESTAMP),
CASE
WHEN i2.thisDT IS NULL THEN 1
ELSE 0
END AS Today,
i1.Grp
FROM Install AS i1
LEFT JOIN Install AS i2 ON i2.StationID = i1.StationID
AND i2.Item = i1.Item + 1
)
,Streak (StationID, Version, DaysPassed, Today)
AS (
SELECT m.StationID,
m.Version,
DATEDIFF(DAY, m.thisDT, COALESCE(e.EventDateTime, m.nextDT)) AS DaysPassed,
m.Today
FROM Mtrx AS m
LEFT JOIN @Events AS e ON e.StationID = m.StationID
AND e.EventDateTime BETWEEN m.thisDT AND m.nextDT
AND e.EventTypeID = 2
)
-- Show some statistics
SELECT StationID,
Version,
MAX(DaysPassed) AS LongestStreak,
MAX(CASE WHEN Today = 1 THEN DaysPassed ELSE 0 END) AS DaysSinceLastInstall
FROM Streak
GROUP BY StationID,
Version
ORDER BY StationID,
Version
N 56°04'39.16"
E 12°55'05.25"
August 1, 2007 at 3:26 am
Thank you for your help.
The EXISTS did not behave differently from IN, and I can't say I even understood the last answer, though it seems a much more thorough solution than my own.
I altered the WITH solution slightly, and it seems to have worked (though if my results are correct I cannot be certain... they do look as if they could be though)
This is my final solution, for after the change and before the change, respectively.
WITH UpdatedStations (station_id, DaysSinceUpdate) AS (
SELECT station_id, CAST(GETDATE() AS float) - CAST(MAX(event_date_time) AS float)
FROM events
WHERE event_type_id = 5
GROUP BY station_id)
select CAST(count(*) AS float) / SUM(DaysSinceUpdate)
from UpdatedStations u
join events e
on u.station_id = e.station_id
where e.event_type_id = 2 AND version=614;
WITH NotUpdatedStations (station_id, DaysBeforeUpdate) AS (
SELECT station_id, CAST(MAX(event_date_time) AS float) - oldestEvent
FROM events,
(
SELECT CAST(MIN(event_date_time) AS float) AS oldestEvent FROM events
)
oldEvents
WHERE event_type_id = 5
GROUP BY station_id, oldestEvent)
select CAST(count(*) AS float) / CAST(SUM(DaysBeforeUpdate) AS float)
from NotUpdatedStations u
join events e
on u.station_id = e.station_id
where e.event_type_id = 2 AND version=597
August 1, 2007 at 4:58 am
Ingen fara och lycka till.
Är det något mera så "post sample data and expected output" så det blir lättare för oss att förstå vad du är ute efter.
N 56°04'39.16"
E 12°55'05.25"
August 1, 2007 at 7:38 am
I've just realised that it doesn't work at all, for several reasons. The one being least my fault is the fact that
select count(t1.id) from table1 t1, table2 t2
will count the rows of table1 multiplied by the rows of table2, instead of only the rows of table1... (which I noticed while trying to show the total amount of rows for each table, in the same query)
There are far more things wrong with my statements, the dates are way off target, being limited by a WHERE clause that's irrelevant for it. (the MAX(event_date_time) should be for all events, not only the version changes)
I was simply staring myself so blind at it that I didn't see the error when I wrote it.
I'll try to post sample data, but since I have a generator doing it over 3.000.000 posts for me, I'm not sure it will make sense when I try to immitate the relevant bits.
I haven't been able to find any help section about how to get indentation and color codes straight here, so I apologize for how messy my samples are:
station_id - event_type_id - event_date_time - version
110010 - 2 - 2007/07/09 - 597
110014 - 5 - 2007/07/09 - N/A
110016 - 5 - 2007/07/09 - N/A
110010 - 2 - 2007/07/08 - 597
110014 - 2 - 2007/07/08 - 597
110015 - 2 - 2007/07/08 - 614
110016 - 2 - 2007/07/08 - 597
110017 - 2 - 2007/07/08 - 615
110018 - 2 - 2007/07/08 - 614
110010 - 2 - 2007/07/05 - 597
110014 - 2 - 2007/07/05 - 597
110015 - 5 - 2007/07/05 - N/A
110016 - 2 - 2007/07/05 - 597
110017 - 5 - 2007/07/05 - N/A
110018 - 5 - 2007/07/05 - N/A
(the version field isn't N/A for the version updates, but it looked a bit OT in the summary, because it is not to be used for those particular event types)
The result should show how many times per day a certain version is expected to crash (event_type_id=2)
The version changes over time, therefor you need the total amount of crashes for a specific version, and then sum up the amount of days-per-station a version is live. The days are summed up over all the stations because it's easier than to divide the crashes and the days over the changing amount of stations that run a certain version.
Say we have two stations, 1 and 2. After half the time, station 1 changes its version. Then we take the full time + half the time, and the total amount of crashes for both stations.
Crashes / time, gives us the average, and we can add to both sides of the division to get it over several stations.
The data generator that fills the database is supposed to generate data that simulates there being only 50% as many crashes (not 50% more, but half as many) for the new version as for the old.
However, this generator had a speed fix to only update about 85% of the stations. This fix did not include fixing the crash possibility for the stations that were no longer updated. Therefor the statistics will currently only be "correct" if you only calculate them based on the stations that have been updated.
The "starting date" to calculate time from, for the old versions, is simply the first registered date that exists in the database (in my example that would be 2007/07/05)
The "ending date", for the new version, can be taken either to be the current date or the last one registered in the database - it doesn't really matter in this case.
I am casting the dates to float to include differences in hh:mm:ss as well as in days (I didn't write them here though, it's not that relevant for the example)
Using my example, the results should show that the new one has had (approximately) 1 crash over 2 days (or 3 crashes over 6 days, when summed up - it gives the same result)
The old one however has had 2 crashes over 2 days
The ones with station_id=110010 has never had an event_type_id=5 event, and so it should be ignored in the query.
The query should also, preferably, be able to do everything directly and answer in one resultset, because I need to send the query and accept the results in another programming language.
I realise that might be unrealistic, but the option would be to stray off the ordered path and do this in the other programming language (which would be far easier for me, but isn't the goal of this task)
I hope this made some sort of sense, because I don't feel overly coherent at the moment.
August 1, 2007 at 9:15 am
With the test data posted above, what is the expected result?
N 56°04'39.16"
E 12°55'05.25"
August 2, 2007 at 2:51 am
Oh, sorry if I didn't make that clear enough in my post
The expected result is
0.3 or similar for entering version=614 in the query and
approximately 0.6 for entering version=597 in the query
I haven't quite calculated this perfectly, but for version 597 there should be approximately twice the amount of crashes as for 614. (The generator says that for each random crash, if it's a 614 it has a 50% chance of not having occurred... like getting to roll the dice a second time.)
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply