flag records that occur x days after the original

  • hi,

    i have a table of data such as..

    Id Customer Product Date

    ----------- ----------- ---------- -----------------------

    1 1 Book 2011-01-01 00:00:00.000

    2 1 Book 2011-04-01 00:00:00.000

    3 1 Book 2011-05-01 00:00:00.000

    4 1 Book 2011-07-01 00:00:00.000

    5 1 Book 2011-09-01 00:00:00.000

    all (hah!) that i need to do is flag all records that occur after x days of the 'original', the original is defined as the first instance of a customer and product combination..

    say for example x was 45 days the records 1, 2, 4, 5 would be flagged because 1 is the first knowledge of this combination, and the others appear 45 days after the last one was flagged

    similarly if x was 120 days then only records 1, and 4 would be flagged, again due to 1 being the first knowledge of this combination, and 4 being the only record that is over 120 days away from the previously flagged record...

    i hope that makes sense, i am thinking i need to be doing something with recursion / CTEs but it's sending me around the bend

    please ask if any further clarification is required!

    cheers guys!! 🙂

  • That's certainly doing but I don't understand why you need to flag those. I can't help to think that there's more to it than what you're letting us know.

    Maybe we can help rethink the whole process rather than just this little part.

  • Ninja's_RGR'us (5/13/2011)


    That's certainly doing ...

    ??

    Ninja's_RGR'us (5/13/2011)


    I can't help to think that there's more to it than what you're letting us know.

    Maybe we can help rethink the whole process rather than just this little part.

    fair comment, you are entirely correct that there is more to this 🙂

    as a caveat yes, i know it's a bit of a strange requirement, but it's a metric that the business i am working with require on their sales records.. i.e. if the same customer enquires about the same product multiple times within the specified period x, they only want to count it once.

    As you might (might not) have worked out, this is a sub set of a Fact table, and this bit flag is to say "yes, count these records" or "no, ignore these records" it's just the logic to determine which records need to be flagged for counting which is kinda strange (perhaps not as strange as the business wanting to measure enquiries in this manner, but nm)

    i hope this put things into perspective a little?

    thanks

  • SOrry for the incomplete phrase... but you got the idea anyways.

    How much data in this table (rows and gb in the table)?

    Can we have the table definition including keys and indexes?

    I preffer to give a fully tested script rather than weird pseudo code for this one.

  • Ninja's_RGR'us (5/13/2011)


    SOrry for the incomplete phrase... but you got the idea anyways.

    How much data in this table (rows and gb in the table)?

    Can we have the table definition including keys and indexes?

    I preffer to give a fully tested script rather than weird pseudo code for this one.

    the fields above are the only relevant ones, even if they have been given new names to protect the innocent 😉

    there is only a clustered index on the id column and contains around 1bn records in the table currently, but that is only around 1/8th of the required data. there is scope to create indexes so long as they are dropped again straight after completion, the reason we don't have them is due to the size, access speed it not a concern as only the latest data is processed into the cube

    this script is intended to be an 'oh no we messed up a data load, lets fix this flag' this is the only thing that relies on a continuous stream of data (i.e. daily loads) in order..

    as it stands i have a stored procedure that will do a single day, and that i am currently running.. ( should be finished now actually) executes it in a loop. it's been running for 2 days, but it works, just after something a bit more elegant.

    ty.

  • Ok so you only need to process the new rows added today, but in a table with over 1 Billion rows? How many rows do you need to process daily?

    What's the default # of days range you need to work with?

  • Ninja's_RGR'us (5/13/2011)


    Ok so you only need to process the new rows added today, but in a table with over 1 Billion rows? How many rows do you need to process daily?

    What's the default # of days range you need to work with?

    Hi, no that is not correct, i already process the rows that are added today, each day is about 10million.. the issue i am having here is that in a table that already contains multiple days data i need to, for want of a better phrase, regenerate the bit column that identifies that row/record 😉 as the 'first' within that 10, 20, 30, 40 whatever day period..

    for arguments sake, we can say that the range is 30 days, that fact is arbitrary at this point imo.

    CELKO (5/13/2011)


    Please post real DDL. Learn to use ISO-11179 rules for the data element names, avoid needless dialect and use ISO-8601 temporal formats, codes and so forth. People cannot read your mind, so post your code and clear specs if you really want help.

    CREATE TABLE Sales

    (sales_nbr INTEGER NOT NULL PRIMARY KEY,

    customer_nbr CHAR(10) NOT NULL,

    upc CHAR(13) NOT NULL,

    sales_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL);

    >> I need to do is flag all records [sic: rows are not records [sic] that occur after x days of the 'original', the original is defined as the first instance of a customer and product combination. <<

    NO! NO! Flags are assembly language and not SQL. And until you learn what a row is, you will never write good SQL. We put this stuff in VIEWs. Let's find the first occurrences:

    CREATE VIEW Initial_Sales

    AS

    SELECT customer_nbr, upc, MIN (sales_date) AS initial_sale_date

    FROM Sales

    GROUP BY customer_nbr, upc;

    Now use the DATEADD in other VIEWs. This could also be a CTE

    CREATE VIEW Initial_45_Sales

    AS

    SELECT customer_nbr, upc, initial_sale_date,

    DATEADD (DAY, 45, initial_sale_date) AS boundary_sales_date

    FROM Initial_Sales;

    SELECT S.*

    FROM Sales AS S, Initial_45_Sales AS S45

    WHERE S.sale_date NOT BETWEEN S45.initial_sale_date AND S45.boundary_sales_date

    AND S.customer_nbr = S45.customer_nbr

    AND S.upc = S45.upc;

    hi, i was trying to be as specific as possible to my problem, but i can understand people wanting to know all the surrounding nuggets of information, but imo it really isn't necessary 🙂

    Now you have posted that code, i can clearly see you have approached it from another angle to me. i was self joining the table in order to try and work out if the diff between them was greater than a defined range and falling over a bit, i like your approach, it should help me move forward.

    a couple of points on the side though..

    * if it's not a flag what should it be? myself and the company i work for and practically every other Microsoft developer i've ever had the (mis)fortune to meet refers to a bit column as a "bit flag", ok.. it might be slang, but so be it :hehe:

    * the temporal format i have posted in the OP is the default output of SQL Server 2008 R2, I'm not sure what additional notation ISO-8601 adds, perhaps you could elaborate?

    * lastly, this is a bit of a personal one i guess.. but if i started work on a db and saw abbreviated field names such as cust_nbr etc in a database, i'm going to hunt down and stab the developer ;-). Micorosft best practices are to use camel case full names are they not? Perhaps i am misinformed on that one.. but i was always taught CustomerNumber is the correct choice.

    Anywho, thanks for your help! :w00t:

  • um.. something i just realised, guess i must have overlooked this point when i replied to the thread at 2.30am 😀

    after the first 45 days (which start date differs for each unique customer and product combination) the marking of the relevant record as the 'one to count' needs to restart.. see OP for examples of this

    it's actually this part that i was/am having most trouble with, i speculate that it could be done in a recursive CTE but am struggling to visualise it, let alone write it.

    i think i will likely have another stab this evening, and post up some code

    cheers guys

  • hi all, just thought i should follow this up with some detail in case anyone else needs to do something similar 🙂

    big thanks to the guys that tried to help, especially the super patronising CELKO.. much appreciated :rolleyes:

    so anyway, what i have done in the end is to use a SSIS script transformation to compare row by row the previous row to the next one..

    select distinct

    Id

    ,Customer

    ,Product

    ,Date

    ,dateadd(dd,90,Date) ExpiryDate

    from myTable

    order by Customer, Product, Id

    I then simply use a script task in SSIS to 'throw away' the next record if its [Created]Date is less than, or equal to the previous records ExpiryDate. Works a treat.

    please post if you want more detail 😀

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply