Best approach in ADF to mimic stored procedure logic against parquet files

  • Current environment

    • We use stored procedures (in Production) to query our source application tables (hourly) and load fact and dimension tables for our reporting environment
    • Our reporting environment (Power BI tabular model) points to the fact and dimension tables

    Goal

    • Mimic the exisiting SQL logic (that currently resides within multiple stored procedures/views) to transform these parquet files into fact and dimension tables (or parquet files) for our reporting environment

      • There is a significant mount of data manipulation to do this and that logic currently sits inside of our stored procedures

    Our skillsets

    • Traditional ETL development using stored procedures, structured SQL and SSIS
    • We do not use SSIS packages in our current environment so we do not have any packages to migrate to ADF

    Our Data Modernization platform progress to date

    • Using ADF, our application tables are extracted incrementally each hour into Azure Data Lake (Gen2) into parquet files
    • The parquet files with the incremental changes are merged and loaded into individual folders (tables) into Azure Data Lake (Gen2)

    Next steps that we need guidance on

    • How do we transform the merged parquet files (e.g., mimic the stored procedure logic we currently use in our SQL managed instance) in Azure Data Factory (ADF)?  Here are some options we are considering:

      1. Should we be using Notebooks in Databricks to replicate our stored procedure logic (based on our existing skillsets) and push to a Delta Lake?
      2. Should we look at using Data Flow activities in ADF?  This activity seems too limited for the type of transformations we perform on the data.
      3. What about using the U-SQL activity from Data Lake Analytics in ADF against the parquet files?  Will this give us the flexibility and performance we need?
      4. Should the final destination of the fact and dimension tables reside in parquet files or a SQL instance?

       

    What are your thoughts on the best approach?

    Thanks very much

     

    • This topic was modified 3 years, 1 month ago by  pcharbonneau.
    • This topic was modified 3 years, 1 month ago by  pcharbonneau.
  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Databricks notebooks could be a valid option. U-SQL is basically dead, so I wouldn't do any new development with that. You could also look into creating an Azure Synapse workspace and using serverless SQL pool. Synapse workspaces come with a Serverless SQL pool. You can choose whether or not to provision a dedicated SQL pool.  See this link for more info. It handles parquet files pretty nicely.

  • Thank you for your comments Meagan and the great link.  It's good to know that we shouldn't invest any time into U-SQL.

    Things I like about Synapse are the following points (from your link) :

    • A full data warehousing solution allowing for full relational data model, stored procedures, etc.
    • Provides all SQL features any BI-er has been used to incl. a full standard T-SQL experience
    • Brings together the best SQL technologies incl. columnar-indexing
    • You can use Power BI directly from Synapse Studio

    The Azure Synapse approach sounds like a possibility, however, our datasets are very small (less than 10 GB...something I should have clarified in my post).  Does it still make sense then to use Synapse?

    Thanks

    Phil

  • I think the choice of data lake parquet files vs SQL database should be based upon your long-term plans for usage of the data as well as the skillset of the teams that will use and maintain it. A data lakehouse approach is valid, and that can be done with Synapse or Databricks. But if you have lots of end users who will access the dimensional model, consider their comfort level and the types of queries they run. If it's just there to populate Power BI models or to send data to other systems, that makes things simpler. The query engines in Databricks and Synapse are getting pretty good, but sometimes they are not quite as snappy as a well built SQL database. So if you need crazy fast performance, and you don't expect your data to grow exponentially in the foreseeable future, you might lean toward a database. Moving the data one more time adds complexity simply because you have added another technology to the solution. That might be worth it if that system provides extra value (like better performance for your workload or familiarity for end users, etc.).

    The marketing side of Azure Synapse pushes the big data thing pretty hard. A dedicated SQL pool would be overkill for your data as it is an MPP engine. But the serverless SQL pool might work ok. It probably would not be as snappy as if you loaded the data into SQL database. It's similar to how SSIS is pretty snappy for moving and transforming small data, whereas ADF is often slower (but can handle a lot more data and more parallelism, and isn't so highly coupled to data schema).  I'm not sure if that will matter to you. I think Synapse serverless SQL pool is still an option to consider, but I would try some of your queries and Power BI model refreshes to make sure the query times are acceptable. Do a quick POC and make sure it has the features you need.

    From a long-term product perspective, we can see Microsoft putting a big focus on Synapse and Power BI and making those two tools work well together. But you'll have to decide if all the features you need/want are there today with the serverless SQL pool.

  • Thanks for your reply Meagan.  I just wanted to update you on the approach that we have decided to go with.  Here is a summary of our overall approach:

    • Bronze (raw data):  Using pipelines in Azure Data Factory, our application tables are extracted incrementally each hour into Azure Data Lake (Gen2) into parquet files
    • Silver (enrichment):  Using pipelines in Azure Data Factory, the parquet files with the incremental changes are merged and loaded into individual folders (tables) into Azure Data Lake (Gen2)
    • Gold (data warehouse):  Using Notebooks in Azure Databricks, we will perform the transformations (mimic the stored procedure logic from our traditional SQL instance) and store the dimension and fact tables in Delta Lake (Lakehouse architecture)

    While our data volumes are so low (curently 10 GB), we will point our Power BI tabular data model at the Delta Lake tables.  We will eventually move to Azure Synapse when our data volumes warrant the move to a more scalable architecture.

    Would be curious to hear your feedback.

    Thanks again.

     

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply