Aggregation

  • I'm confused about how to aggregate a value in SSIS. Let's say I have a task that simply exports the employees from the DB to a flat file. I also have to keep track of MAX(EmpoyeeId) so that the next time I do this, I start with all the new employees that have been added since the last time.

    How/where do I handle that aggregate? I just want a way to get that value in a variable. My first instinct was to throw an aggregate on the data flow, but I don't think that's right. That will get executed once for every record in the source. That's not what I want. I could also just drop a SQL Task on the control flow and handle it that way, but technically speaking, that's an entirely different query. It's not actually based on the Employees in the data flow.

    Maybe a better way to ask the question is how do I get a MAX value from a data source into a user variable.

    .

  • Use a SQL Server statement as you suggest (you can get the result of a single-valued T-SQL query into a variable (say MaxID) directly from the Execute SQL task).

    Then use that returned value in your source query for the dataflow (pseudo-code: SELECT f1, f2, ... FROM EMPLOYEE WHERE ID <= User::MaxID). That way you know that things are consistent ...

    ...but, getting to the crux of the matter, how does knowing the MaxID now (for the current extract) help you when the job next runs?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Thanks Phil, this is really not about employees at all. I just thought that might make my example a little easier to explain. This is really about exporting data to a vendor once per month. Each month, I have to make sure I don't send anything that was sent the previous month, so I keep track of the MAX record id every month and use that as a starting point the next month.

    I'll give your example a try. So anything I put in the data flow gets executed once for each record. Do I understand that correctly?

    .

  • Correct,

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • BSavoie (1/23/2010)


    Thanks Phil, this is really not about employees at all. I just thought that might make my example a little easier to explain. This is really about exporting data to a vendor once per month. Each month, I have to make sure I don't send anything that was sent the previous month, so I keep track of the MAX record id every month and use that as a starting point the next month.

    I'll give your example a try. So anything I put in the data flow gets executed once for each record. Do I understand that correctly?

    I think you need to consider this a bit more. In your example, what happens if you send that data on Employee A last month, that employee is terminated this month - but, since it has already been sent you won't be sending it again?

    In other words, what about updates to the system that need to be updated in the downstream systems? How are you going to identify those?

    Find the column that identifies the last updated date for that entity (or creation date). Then use that date to filter for any records that have been modified since the last time you extracted the data.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Thanks Jeffery. I was just using Employees as an example to make things easier. Actually what I'm really sending is all the Guest Reservations that have stayed at a hotel over the last month. So, it's not really a volitle entity like an employee. That was probably a bad example. Thanks for the feedback!

    .

  • If its as simple as a date range, why not put the following query?  Or if you're looking to calc reservations over time, then group by month,

    SELECT *
    FROM Reservations
    WHERE ReservationDate BETWEEN @StartDate AND @EndDate

    Or, Rolling 12 months Trend.
    SELECT YEAR(ReservationDate) AS ResYear, MONTH(ReservationDate) AS ResMonth, COUNT(*) As ResCount
    FROM Reservations
    WHERE ReservationDate > DATEDIFF(yy, -1,GETDATE())
    GROUP BY YEAR(ReservationDate), Month(ReservationDate)
    ORDER BY YEAR(ReservationDate), Month(ReservationDate)

  • Tim Curtin - Thursday, February 22, 2018 6:28 AM

    If its as simple as a date range, why not put the following query?  Or if you're looking to calc reservations over time, then group by month,

    SELECT *
    FROM Reservations
    WHERE ReservationDate BETWEEN @StartDate AND @EndDate

    Or, Rolling 12 months Trend.
    SELECT YEAR(ReservationDate) AS ResYear, MONTH(ReservationDate) AS ResMonth, COUNT(*) As ResCount
    FROM Reservations
    WHERE ReservationDate > DATEDIFF(yy, -1,GETDATE())
    GROUP BY YEAR(ReservationDate), Month(ReservationDate)
    ORDER BY YEAR(ReservationDate), Month(ReservationDate)

    Note that you are responding to a thread which is 8 years old 🙂

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply