Time Table For simple ETL

  • Hi,

    I am trying to build a simple time table and pre-populate the rows for lets say 1 year. Basically, I will be extracting the data froma source based on the boundaries of this table. eg of this table as I can visualize:

    Date StartTime Endtime IsProcessed

    2014-06-17 2014-06-17 00:00:00 2014-06-17 01:00:00 0

    2014-06-17 2014-06-17 01:00:00 2014-06-17 02:00:00 0

    2014-06-17 2014-06-17 02:00:00 2014-06-17 03:00:00 0

    Lets call this TimeTable and we have our souce and destination table

    The ETL process will pick the start time and endtime, will pass these as parameters to the source and extract the data as follows:

    Insert into Destination

    select * from source where entrydate>=TimeTable.StartTime and entrydate<TimeTable.EndTime

    Once this data gets loaded, the IsProcessed field is set to 1 so that next time it won't be considered.

    So as you see, I am trying an incremental load process here.As I do not have much help from dev team as they are little busy so I am trying to give it a shot myself and am little puzzled at this. Please suggest.

    Thanks

    Chandan

  • Just my 2 cents.

    Wouldn't it be easier to just store the maximum entrydate you loaded in the last run.

    Say you ran the initial load and the max entrydate is 2014-05-17.

    Next time, you load all data where entrydate > 2014-05-17.

    (or do you want to chop an existing data load into multiple pieces?)

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (6/17/2014)


    Just my 2 cents.

    Wouldn't it be easier to just store the maximum entrydate you loaded in the last run.

    Say you ran the initial load and the max entrydate is 2014-05-17.

    Next time, you load all data where entrydate > 2014-05-17.

    (or do you want to chop an existing data load into multiple pieces?)

    thanks for your inputs.With only date format, it will look simpler but then I might have to put more logic to fetch out hourly data or lets say I want to pull data for a particular hour belonging to a date few days back in history. As of now, i do have the flexibility of making this too simple and I can even afford to truncate the destination and load all over again but then I will miss out the opportunity to learn to do this in a right way.

    Thanks

    Chandan

  • OK, here some code using a tally table.

    DECLARE @YearStartDATETIME2 = '2014-01-01 00:00:00';

    DECLARE @YearEndDATETIME2 = '2015-01-01 00:00:01';

    WITH CTE_Tally AS

    (

    SELECT TOP 8785 n = ROW_NUMBER() OVER (ORDER BY o1.object_id)-- 8784: 24 hours for 366 days (including leap years).

    -- Added 1 extra because 1 hour will be substracted to start from 0.

    FROMmaster.SYS.objects o1

    CROSS JOINmaster.SYS.objects o2

    CROSS JOINmaster.SYS.objects o3

    )

    SELECT

    [Date]= CONVERT(DATE,DATEADD(HOUR,n-1,@YearStart))

    ,StartTime= DATEADD(HOUR,n-1,@YearStart)

    ,Endtime= DATEADD(HOUR,n,@YearStart)

    FROM CTE_Tally

    WHERE DATEADD(HOUR,n,@YearStart) < @YearEnd;

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Thanks for your great solution using recursion. Now, I am in a spot of bother due to my own question. If I pre-populate the time frames lets say 1 year in advance, even though indexes will be there and it will return 0 rows for the time frame which has not occured, but theoretically this does not look elegant.

    I then go back and populate this table at the time the job runs eg. if the job runs at 17 June 2014 2:30 pm, let it populate the start time as 1 and end time as 2 pm and get the data from the source between this hour but then what if my server crashes for 2 days? when it resumes, it will pick up the timestamp when it starts and will have no way to get my source data which has been running for these 2 days also.

    sorry for so much of confusion.

    thanks

    chandan

  • chandan_jha18 (6/17/2014)


    Thanks for your great solution using recursion. Now, I am in a spot of bother due to my own question. If I pre-populate the time frames lets say 1 year in advance, even though indexes will be there and it will return 0 rows for the time frame which has not occured, but theoretically this does not look elegant.

    I then go back and populate this table at the time the job runs eg. if the job runs at 17 June 2014 2:30 pm, let it populate the start time as 1 and end time as 2 pm and get the data from the source between this hour but then what if my server crashes for 2 days? when it resumes, it will pick up the timestamp when it starts and will have no way to get my source data which has been running for these 2 days also.

    Technically it is not recursion, it is a set-based operation (even faster than recursion in TSQL).

    Anyway, regarding the pre-population. It kinds of depends on how you are trying to control your load.

    Personally I would pre-populate a lot and when the job runs it checks what has not yet been loaded.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply