T-SQL quandary

  • Hello,
    I've got a SQL issue that set-based languages don't handle very well.
     
    I have a table with a PK and X numerical fields. For each record I want to get the maximum value in these X fields.
     
    Any ideas about how to accomplish this efficiently? Or more accurately does anyone know what the least inefficient way of doing it is?
    I don't want to have to UNION the table with itself X times and then do a MAX aggregation cos there's alot of data.
     
    cheers
    Jamie
  • >>I've got a SQL issue that set-based languages don't handle very well

    Set based languages handle it really well if the database design is correctly normalized and doesn't have repeating elements in the same record It's not a SQL issue, it's a design issue.

    You could write your own UDF to take 2 input parameters and return the max, then call the function multiple times:

    Select dbo.YourUDF( dbo.YourUDF( dbo.YourUDF(Column1, Column2), Column3), Column4 ) As MaxValue

    Not sure that's going to be any more efficient than a UNION.

  • Ooo cutting

    Well I didn't design it so I can pass the buck there. However I'll stick up for the designer by saying in this circumstance it is a valid problem. The table is for decision support so its a denormalised structure.

    I've gone with the UDF solution for now. I was just wondering if anyone has come across this before and what they did to solve it. And was there solution any quicker than using a UNION.

    Thanks for the reply.

     

  • 1. it should be UNION ALL (not UNION).

    2. I would do it with a combination of inline functions instead of UDF is the set is large

    3. SQL is optimized for ROW processing not for column processing and that's why this kind of design could get you in to problems

    Cheers, 


    * Noel

  • Thanks for the comments guys. And yes, I was testing this using a UNION ALL, not a UNION

    We have ~40000 rows so its not too massive.

    I completely agree about SQL being for row processing...however in this example the design is correct because the table is denormalised for reporting.

    After a bit of testing it turns out that a nested CASE statement is the quickest. The UNION ALL and scalar function methods took ~800ms, the nested CASE took ~140ms. Case closed.

     

     

  • >>however in this example the design is correct because the table is denormalised for reporting.

    Maybe the design isn't correct, since the process that creates the denormalised data (presumably from normalized source) doesn't create the required aggregates at the same time ?

    If you're going to the coding effort of making a physical table design match a specific report, why not take it to conclusion and persist the required aggregates at the optimal time, when the normalized data is being processed row-wise ?

  • I think PW is in the right track. At insert/Update time you could pass from the client the Value needed (max/min/arcosh..) then your query will be a simple select

     


    * Noel

  • Hi,

    Thanks for the continued advice.

    We have 4 values that we have to calculate the max of. Think of them as 4 facts (for that is what they are).

    The 4 values come from disparate sources hence I am literally dumping the values into the fact table instead of building a mega-complicated SQL statement that UNIONs data from 4 sources and then aggregates that UNIONed data. In effect I am dumping the 4 values into the fact table as an interim step for simplification. I populate from source1 and then run UPDATEs against that data for the other 3 values.

    This process is instigated via a web front-end and returns a message to a process log (viewable via the front-end) after each step. This is a user requirement - they don't want to see 2 messages STARTED & FINISHED, they want to see notification of the interim steps.

    This is the reason that I need to calculate the MAX value from 4 existing values in the same row.

    Does this make sense? I'm not very good at explaining things.

     

  • Well, you'll get several different points of view on this, depending on which data warehousing "religious camp" you listen to, but this process sounds like a "staging" or "back office" type task. This is where you pull together data from disparate sources & integrate them, cleanse them etc, before loading the cleased, integrated set of facts into a fact table.

    The "fact" table is the presentation portion, and should be the only part visible to your end user or reporting tool, neither of which should see a "fact" table as a "work in progress" subject to partial loads, then subsequent updates.

    Of course, if you're constrained by an existing design & processes, there's not much you can do except optimise within those constraints, and it looks like you've found the optimal solution.

     

  • Yeah, you'll have to believe me (cos I'm not going into detail) when I say that in this instance its not possible. I used the term fact table liberally...in real terms this user instigated process IS the back-end processing prior to us pushing to the DW. The table I am dealing with is, in actual fact, the staging_fact table. Of course, to the users they are directly pushing to the DW, to them the front-end and the warehouse DB they they get their reports from are one and the same when of course physically this isn't actually true.

    I didn't want to say that at the outset so as to avoid howls of derision about why users are involved in our back-end ETL. This isn't your common-or-garden DW as you've no doubt realised.

    This was fun...I enjoy discussions like these about slightly-left-of-field situations and "out-there" user requirements  Thanks guys.

     

     

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply