Why does this stall?

  • Hi

    First of all, I'm very far from being an expert on any SQL at all, so my code probably looks unnecessarily complicated and this is probably a newbie question, even if a bit of a hassle to solve...

    I am limited in the respect that I'm only able to preform queries, I do not have the right to post data, create views nor functions.

    I have a fairly large table to work with (>3.000.000 rows), and I'm trying to extract statistics from it.

    The table covers "events" and contains the time of the events and the station it concerns, and so on...

    "system update" is event_type_id=5, and I want to calculate statistics for crashes (event_type_id=2) ONLY for those stations that have been updated.

    (maths: amount of crashes divided by the amount of days it has been a certain version, over the entire population of stations that have been updated)

    Here is my code, it's one of two, of which the other covers the other version (614 instead of 597)

    declare @stop datetime

    set @stop='2007/07/03'

    /*the last day the database was open for inserting events*/

    select CAST(count(id) AS float)/SUM(daysOld)

    from events,

    (

    SELECT station_id, MIN(event_date_time) as start,

    MIN(change_date) as change,

    /*If any of the days count is null they will be replaced by the total amount of days or 0*/

    ISNULL(CAST(@stop AS float)-CAST(MIN(change_date) AS float),0)

    AS daysNew,

    ISNULL(CAST(MIN(change_date) AS float)-CAST(MIN(event_date_time) AS float),

    CAST(@stop AS float)-CAST(MIN(event_date_time) AS float))

    AS daysOld

    from events

    LEFT JOIN

    (

    SELECT station_id as changed_station, event_date_time as change_date

    FROM events

    WHERE event_type_id=5

    )

    change

    ON station_id=change.changed_station

    group by station_id

    )

    tempus

    WHERE event_type_id=2 AND version=597 AND tempus.station_id in

    (select distinct station_id from events where event_type_id=5)

    I gave this five minutes of execution before I interrupted it. If I remove the "in (select)", it completes the query in a matter of seconds.

    It does not have any problems when I set tempus.station_id to equal a specific number in the WHERE clause, only when I add the "in(select)"

    Please note that this code is in "trial and horror" phase, so I'm doing some unnecessary selects merely because I used them in a previous attempt and I'm not ready to file them away quite yet.

    I would appreciate any ideas of how to solve this, because I have spent too much worktime on it already.

  • Marie-Sofie

    Does this get you any closer?  I've left it to you to do the manipulation of the figures.  Obviously it's not tested since there was no sample data, and performance will depend on indexing and other factors.

    WITH

    UpdatedStations (station_id, DaysSinceUpdate) AS (

    SELECT station_id, DATEDIFF(day, MAX(event_date_time), GETDATE())

    FROM events

    WHERE event_id = 5

    GROUP BY station_id)

    select station_id, DaysSinceUpdate, count(*)

    from UpdatedStations u

    join events e

    on u.station_id = e.station_id

    where e.event_id = 2

    group by station_id, DaysSinceUpdate

    John

  • Try exists instead

     

    WHERE event_type_id=2 AND version=597 AND tempus.station_id in

    (select distinct station_id from events where event_type_id=5)

    WHERE event_type_id=2 AND version=597

    AND EXISTS (select * from events as p9 where p9.event_type_id=5 and tempus.station_id = p9.station_id)

     


    N 56°04'39.16"
    E 12°55'05.25"

  • -- Prepare sample data

    DECLARE          @Events TABLE (StationID INT, EventDateTime DATETIME, EventTypeID INT, Version INT)

     

    INSERT           @Events

    SELECT           1, '20070101', 5, 597 UNION ALL

    SELECT           1, '20070301', 2, NULL UNION ALL

    SELECT           1, '20070304', 5, 597 UNION ALL

    SELECT           1, '20070720', 2, NULL UNION ALL

    SELECT           1, '20070721', 2, NULL UNION ALL

    SELECT           1, '20070725', 5, 614 UNION ALL

    SELECT           2, '20070211', 5, 322 UNION ALL

    SELECT           2, '20070212', 5, 597 UNION ALL

    SELECT           2, '20070302', 2, NULL UNION ALL

    SELECT           2, '20070303', 5, 597 UNION ALL

    SELECT           2, '20070529', 2, NULL UNION ALL

    SELECT           2, '20070611', 2, NULL UNION ALL

    SELECT           2, '20070701', 5, 614

     

    ;WITH Install (StationID, thisDT, Item, Grp, Version)

    AS (

                     SELECT           StationID,

                                       EventDateTime,

                                       ROW_NUMBER() OVER (PARTITION BY StationID ORDER BY EventDateTime) AS Item,

                                       ROW_NUMBER() OVER (ORDER BY StationID, EventDateTime) AS Grp,

                                       Version

                     FROM             @Events

                     WHERE            EventTypeID = 5

    )

    ,Mtrx (StationID, Version, thisDT, nextDT, Today, Grp)

    AS (

                     SELECT                             i1.StationID,

                                                        i1.Version,

                                                        i1.thisDt,

                                                        COALESCE(i2.thisDt, CURRENT_TIMESTAMP),

                                                        CASE

                                                                          WHEN i2.thisDT IS NULL THEN 1

                                                                          ELSE 0

                                                        END AS Today,

                                                        i1.Grp

                     FROM                               Install AS i1

                     LEFT JOIN        Install AS i2 ON i2.StationID = i1.StationID

                                                                          AND i2.Item = i1.Item + 1

    )

    ,Streak (StationID, Version, DaysPassed, Today)

    AS (

                     SELECT                             m.StationID,

                                                        m.Version,

                                                        DATEDIFF(DAY, m.thisDT, COALESCE(e.EventDateTime, m.nextDT)) AS DaysPassed,

                                                        m.Today

                     FROM                               Mtrx AS m

                     LEFT JOIN        @Events AS e ON e.StationID = m.StationID

                                                                          AND e.EventDateTime BETWEEN m.thisDT AND m.nextDT

                                                                          AND e.EventTypeID = 2

    )

     

    -- Show some statistics

    SELECT                             StationID,

                                       Version,

                                       MAX(DaysPassed) AS LongestStreak,

                                       MAX(CASE WHEN Today = 1 THEN DaysPassed ELSE 0 END) AS DaysSinceLastInstall

    FROM                               Streak

    GROUP BY         StationID,

                                       Version

    ORDER BY         StationID,

                                       Version


    N 56°04'39.16"
    E 12°55'05.25"

  • Thank you for your help.

    The EXISTS did not behave differently from IN, and I can't say I even understood the last answer, though it seems a much more thorough solution than my own.

    I altered the WITH solution slightly, and it seems to have worked (though if my results are correct I cannot be certain... they do look as if they could be though)

    This is my final solution, for after the change and before the change, respectively.

    WITH UpdatedStations (station_id, DaysSinceUpdate) AS (

    SELECT station_id, CAST(GETDATE() AS float) - CAST(MAX(event_date_time) AS float)

    FROM events

    WHERE event_type_id = 5

    GROUP BY station_id)

    select CAST(count(*) AS float) / SUM(DaysSinceUpdate)

    from UpdatedStations u

    join events e

    on u.station_id = e.station_id

    where e.event_type_id = 2 AND version=614;

    WITH NotUpdatedStations (station_id, DaysBeforeUpdate) AS (

    SELECT station_id, CAST(MAX(event_date_time) AS float) - oldestEvent

    FROM events,

    (

    SELECT CAST(MIN(event_date_time) AS float) AS oldestEvent FROM events

    )

    oldEvents

    WHERE event_type_id = 5

    GROUP BY station_id, oldestEvent)

    select CAST(count(*) AS float) / CAST(SUM(DaysBeforeUpdate) AS float)

    from NotUpdatedStations u

    join events e

    on u.station_id = e.station_id

    where e.event_type_id = 2 AND version=597

  • Ingen fara och lycka till.

    Är det något mera så "post sample data and expected output" så det blir lättare för oss att förstå vad du är ute efter.

     


    N 56°04'39.16"
    E 12°55'05.25"

  • I've just realised that it doesn't work at all, for several reasons. The one being least my fault is the fact that

    select count(t1.id) from table1 t1, table2 t2

    will count the rows of table1 multiplied by the rows of table2, instead of only the rows of table1... (which I noticed while trying to show the total amount of rows for each table, in the same query)

    There are far more things wrong with my statements, the dates are way off target, being limited by a WHERE clause that's irrelevant for it. (the MAX(event_date_time) should be for all events, not only the version changes)

    I was simply staring myself so blind at it that I didn't see the error when I wrote it.

    I'll try to post sample data, but since I have a generator doing it over 3.000.000 posts for me, I'm not sure it will make sense when I try to immitate the relevant bits.

    I haven't been able to find any help section about how to get indentation and color codes straight here, so I apologize for how messy my samples are:

    station_id - event_type_id - event_date_time - version

    110010 - 2 - 2007/07/09 - 597

    110014 - 5 - 2007/07/09 - N/A

    110016 - 5 - 2007/07/09 - N/A

    110010 - 2 - 2007/07/08 - 597

    110014 - 2 - 2007/07/08 - 597

    110015 - 2 - 2007/07/08 - 614

    110016 - 2 - 2007/07/08 - 597

    110017 - 2 - 2007/07/08 - 615

    110018 - 2 - 2007/07/08 - 614

    110010 - 2 - 2007/07/05 - 597

    110014 - 2 - 2007/07/05 - 597

    110015 - 5 - 2007/07/05 - N/A

    110016 - 2 - 2007/07/05 - 597

    110017 - 5 - 2007/07/05 - N/A

    110018 - 5 - 2007/07/05 - N/A

    (the version field isn't N/A for the version updates, but it looked a bit OT in the summary, because it is not to be used for those particular event types)

    The result should show how many times per day a certain version is expected to crash (event_type_id=2)

    The version changes over time, therefor you need the total amount of crashes for a specific version, and then sum up the amount of days-per-station a version is live. The days are summed up over all the stations because it's easier than to divide the crashes and the days over the changing amount of stations that run a certain version.

    Say we have two stations, 1 and 2. After half the time, station 1 changes its version. Then we take the full time + half the time, and the total amount of crashes for both stations.

    Crashes / time, gives us the average, and we can add to both sides of the division to get it over several stations.

    The data generator that fills the database is supposed to generate data that simulates there being only 50% as many crashes (not 50% more, but half as many) for the new version as for the old.

    However, this generator had a speed fix to only update about 85% of the stations. This fix did not include fixing the crash possibility for the stations that were no longer updated. Therefor the statistics will currently only be "correct" if you only calculate them based on the stations that have been updated.

    The "starting date" to calculate time from, for the old versions, is simply the first registered date that exists in the database (in my example that would be 2007/07/05)

    The "ending date", for the new version, can be taken either to be the current date or the last one registered in the database - it doesn't really matter in this case.

    I am casting the dates to float to include differences in hh:mm:ss as well as in days (I didn't write them here though, it's not that relevant for the example)

    Using my example, the results should show that the new one has had (approximately) 1 crash over 2 days (or 3 crashes over 6 days, when summed up - it gives the same result)

    The old one however has had 2 crashes over 2 days

    The ones with station_id=110010 has never had an event_type_id=5 event, and so it should be ignored in the query.

    The query should also, preferably, be able to do everything directly and answer in one resultset, because I need to send the query and accept the results in another programming language.

    I realise that might be unrealistic, but the option would be to stray off the ordered path and do this in the other programming language (which would be far easier for me, but isn't the goal of this task)

    I hope this made some sort of sense, because I don't feel overly coherent at the moment.

  • With the test data posted above, what is the expected result?

     


    N 56°04'39.16"
    E 12°55'05.25"

  • Oh, sorry if I didn't make that clear enough in my post

    The expected result is

    0.3 or similar for entering version=614 in the query and

    approximately 0.6 for entering version=597 in the query

    I haven't quite calculated this perfectly, but for version 597 there should be approximately twice the amount of crashes as for 614. (The generator says that for each random crash, if it's a 614 it has a 50% chance of not having occurred... like getting to roll the dice a second time.)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply