For a "typical" datawarehouse, does it make sense to add databricks to the ETL

  • Hi All,

    First off, I hate that I put typical in there.  But it does kinda fit.   We have what you might call a typical DW.  Lots of operational data, most data gets loaded overnight, but we have more than a few individual feeds showing up during the day.  Think customer sales, point of sale traffic, loyalty program data, etc.  Data arrives via SAP bods, text/csv files, some realtime(ish) api's, etc.  The total DW is fairly large, 14TB, but I wouldn't consider it big data.  It's hosted in an azure vm AG, with a readable secondary and a failover node on the opposite coast.  Speed of some of the processing could be faster of course, but anything too slow can usually be optimized to make it acceptable.

    Now for the real question.  We currently use SSIS for almost all etl(maybe 100 to 130 active packages).  They are all very simple.  Load text file with minimal transformations, run a stored proc, lather, rinse, repeat.  Reporting is via Power BI and webi at the moment, but webi is going away.

    I'm looking for use cases from the community where people have started integrating databricks / datalakes into the ETL framework.  Are you storing fact data in the datalake that never makes it to the warehouse?  Are you using databricks for processing where you need to assign keys from dimensional tables that are in the warehouse.  Did you switch to using ADF from SSIS.  How did that go?

    The current environment I'm in is pushing to move to datalake/databricks, but I actually don't know what problem we are trying to solve with the tech.  As a side note, cost probably should be in the mix, but it's not.

    Thanks all.

  • You said that the SSIS packages are all "very simple" and your followup after that seems like everything "works fine, fails safe, and drains to the bilge", to put it in favorable sailor terms.

    That being said, let's not talk about any kind of technology, yet.  Instead, let's ask my favorite Brent Ozar question... What problem are you trying to solve?

    From your description, there doesn't seem to be a problem that needs solving except to scratch someone's "latest and greatest" itch.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Exactly Jeff.  But you didn't read down far enough 🙂

    The current environment I'm in is pushing to move to datalake/databricks, but I actually don't know what problem we are trying to solve with the tech.  As a side note, cost probably should be in the mix, but it's not.

    What I'm hoping for is someone to say "Oh yeah, SSIS is dead technology.  We switched to all databricks/deltalake/etc.  and are way happier and everything runs in a fraction of the time. "  Maybe not exactly that, but at least a nod to newer, better, faster.

    From your description, there doesn't seem to be a problem that needs solving except to scratch someone's "latest and greatest" itch.

    And yes, this makes me nervous.

    Thanks for the comment.

  • To add to what Jeff (and you) said, I wouldn't throw technology at a problem unless you know that it is the solution. And if you don't know what the problem is, how will you know if you picked the right solution?

    Step 1 here in my opinion would be to investigate and research your options such as data bricks. A nice comparison of SSIS, ADF, and Data bricks can be found here:

    https://www.mssqltips.com/sqlservertip/6438/azure-data-factory-vs-ssis-vs-azure-databricks/

    BUT I would still recommend doing your own research to determine the correct tool for the "problem". Next, I would meet with the team that is pushing for data bricks and get them to explain why the current process isn't working and how data bricks will solve it.

    The reason I recommend researching yourself first is so that you can have a bit of an understanding of what the tool does and how it is different than what you currently have. Once you understand how it is different, it will be easier to discuss with the team pushing for it. Imagine someone coming along and telling you to switch to Oracle for your database because it is better. Their main argument "it is more expensive, so it must be a better tool". Then you come back and say "have you used SQL developer? Have you used SSMS? which is the better tool?".

    Another thing you could do is take one of your more simple SSIS loads and re-do it in data bricks. Is it better or worse and why? Makes for a good presentation to the end users as to if you should switch or not.

    THEN you get into the "who will support this?" debate too. You and your team know SSIS, learning data bricks is something I'm sure you can do, but is it worth your time and effort?

    Now as an overall opinion - I am a big fan of the "if it ain't broke, don't fix it" mentality. If SSIS is working and not causing any issues, why "make work"? Now, that being said, I am not saying don't learn data bricks if that is of interest to you and/or your company, I am just saying I wouldn't spend time converting a completely working process with several (I hope) SMEs (subject matter experts) in house to a technology nobody on your team knows how to use or support.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • daytonbrown2 wrote:

    Exactly Jeff.  But you didn't read down far enough 🙂

    The current environment I'm in is pushing to move to datalake/databricks, but I actually don't know what problem we are trying to solve with the tech.  As a side note, cost probably should be in the mix, but it's not.

    What I'm hoping for is someone to say "Oh yeah, SSIS is dead technology.  We switched to all databricks/deltalake/etc.  and are way happier and everything runs in a fraction of the time. "  Maybe not exactly that, but at least a nod to newer, better, faster.

    From your description, there doesn't seem to be a problem that needs solving except to scratch someone's "latest and greatest" itch.

    And yes, this makes me nervous.

    Thanks for the comment.

    Ah... my apologies...  you're correct.  I didn't read far enough down.  Seems like we agree, there. 😀

    To be honest, I'm not a big fan of data warehouses or the ETL to support them.  While I admit that my demo was "only" on a single table, I recently gave my local user group a demo of some fairly intense reporting reporting requirements (some crazy forms of periodic aggregations) on a 100 million row table.  It takes only takes about 63 milli-seconds to execute the report and the data source is always automatically up-to-date with no ETL involved.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply