DataFlow: Data comparisons

  • Hi,

    I have a package that is validating some data and I'm not sure how best to go about it.

    At the moment I have a dataflow with some script tasks doing the bulk of validations and now I have an additional requirement.

    The following is a sample of data coming in from the source in the dataflow, there are other columns in the actual data but they aren't relevant to this requirement.

    PKEY DateKeyActivity Duration

    1 20100927 Item1 3.00

    2 20100927 LEAVE 3.00

    3 20100927 Item2 2.00

    4 20100928 Item1 2.00

    5 20100928 Item3 0.00

    6 20100928 LEAVE 7.60

    7 20100929 Work 7.00

    8 20100930 LEAVE 8.90

    9 20101001 Work 8.00

    10 20101101 Leave 8.00

    11 20101101 Work 2.00

    12 20101102 Leave 7.60

    13 20101103 etcetc 7.00

    14 20101104 etcetc 1.00

    As you can see the data is for time entries.

    The rules are:

    1. A leave day is considered to be 7.6 hours

    2. A leave entry does not span multiple days.

    3. Therefor if some is taking leave they cannot use up more than 7.6 hours in a day.

    4. Someone may have made a booking to take a leave but ends up working all/part of the day and mistakenly records both.

    5. The business ruling is that leave entries submitted with [leave] + [hours worked] greater than 7.6 should be recorded as (7.6 - [hours worked])

    6. If hours worked is greater than 7.6 then 0 leave is recorded.

    This would mean that in the test data:

    row 2 Duration would become 2.6

    row 6 Duration would become 5.6

    row 8 Duration would become 7.6

    row 10 Duration would become 7.6

    In a test package I got this logic to work with the Pivot task but that does not work in reality because the Activity values are not static as required by the Pivot task.

    Any tips on how to go aout this?

    Thanks,

    Sam

  • You mention Activity isn't static, that's going to cause some headaches. Am I correct in assuming LEAVE is always static, but the rest of activity is things like "Modified headgasket on car4" 2:00:00, "Chucked Doug, the boss, off cliff" 0:05:00?

    If that's so, there might be some workarounds, but the trick will be to generating another column on the fly as a grouping/comparison mechanism to indicate leave/non-leave times.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Yes, it's safe to assume Leave is static (one of 3 values) and also that activity is potentially anything.

    If I take the suggestion of adding another column to group on, assume I use a derived column to add, say a 'leave' or 'other' value, then I can use an aggregation (or is that redundant?) then the pivot approach would work because I could then specify the new columns?

    If so I could then do the appropriate calculations, and I would just need a way to apply the results back to the original dataset. This may be difficult if there were 2 types of leave taken in a single day?

    Am I on the right track? I would have hoped there'd be an easier way.

    I'll play around with getting this to work some more today.

  • sam.dahl (12/1/2010)


    Yes, it's safe to assume Leave is static (one of 3 values) and also that activity is potentially anything.

    Definately new column then...

    If I take the suggestion of adding another column to group on, assume I use a derived column to add, say a 'leave' or 'other' value, then I can use an aggregation (or is that redundant?) then the pivot approach would work because I could then specify the new columns?

    Not redundant, really, but depends on how you read it. 🙂 I knew what you meant. Yes, pivot then would work nicely, allowing you to recognize employee/date combinations with an overflow. You would then use that information to link back into your data and adjust the necessary leave items. That'll be a second computation, because you'll need to prioritize which leave loses time first (you'll want to standardize that). The other option would be that you simply use that report as an indicator to a live body to go and adjust the time records.

    Am I on the right track? I would have hoped there'd be an easier way.

    You're on the right track. The problem with this is you're really missing a category function in this table to handle all the random bits. Can't group on something that's inconsistent.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig Farrell (12/1/2010)


    The other option would be that you simply use that report as an indicator to a live body to go and adjust the time records.

    At present we have a warning notification that goes to a live body, this next step is to automate the data processing further.

    You're on the right track. The problem with this is you're really missing a category function in this table to handle all the random bits. Can't group on something that's inconsistent.

    Funny thing that, we/I originally had a richer data structure but the boss wanted to keep it simple...

    Thanks for your help Craig! I'll have a go at implementing something in the coming days when time allows.

    Much appreciated.

  • sam.dahl (12/1/2010)


    Funny thing that, we/I originally had a richer data structure but the boss wanted to keep it simple...

    There's such a thing as oversimplifying if you're going to do something actually useful with the data later... 😀

    Thanks for your help Craig! I'll have a go at implementing something in the coming days when time allows.

    Much appreciated.

    No problem, Sam, a pleasure to help. For future note, get a look at the first link in my signature. It'll help you describe DDL and sample data in a quickly consumable format for us to give you better tested code. Didn't matter in this case since the actual data wasn't close enough to the real issue, which was the leave information being highly variable for the grouping.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply