Current environment
Goal
Our skillsets
Our Data Modernization platform progress to date
Next steps that we need guidance on
What are your thoughts on the best approach?
Thanks very much
November 11, 2021 at 11:10 pm
Thanks for posting your issue and hopefully someone will answer soon.
This is an automated bump to increase visibility of your question.
November 12, 2021 at 6:20 pm
Databricks notebooks could be a valid option. U-SQL is basically dead, so I wouldn't do any new development with that. You could also look into creating an Azure Synapse workspace and using serverless SQL pool. Synapse workspaces come with a Serverless SQL pool. You can choose whether or not to provision a dedicated SQL pool. See this link for more info. It handles parquet files pretty nicely.
November 14, 2021 at 8:11 pm
Thank you for your comments Meagan and the great link. It's good to know that we shouldn't invest any time into U-SQL.
Things I like about Synapse are the following points (from your link) :
The Azure Synapse approach sounds like a possibility, however, our datasets are very small (less than 10 GB...something I should have clarified in my post). Does it still make sense then to use Synapse?
Thanks
Phil
I think the choice of data lake parquet files vs SQL database should be based upon your long-term plans for usage of the data as well as the skillset of the teams that will use and maintain it. A data lakehouse approach is valid, and that can be done with Synapse or Databricks. But if you have lots of end users who will access the dimensional model, consider their comfort level and the types of queries they run. If it's just there to populate Power BI models or to send data to other systems, that makes things simpler. The query engines in Databricks and Synapse are getting pretty good, but sometimes they are not quite as snappy as a well built SQL database. So if you need crazy fast performance, and you don't expect your data to grow exponentially in the foreseeable future, you might lean toward a database. Moving the data one more time adds complexity simply because you have added another technology to the solution. That might be worth it if that system provides extra value (like better performance for your workload or familiarity for end users, etc.).
The marketing side of Azure Synapse pushes the big data thing pretty hard. A dedicated SQL pool would be overkill for your data as it is an MPP engine. But the serverless SQL pool might work ok. It probably would not be as snappy as if you loaded the data into SQL database. It's similar to how SSIS is pretty snappy for moving and transforming small data, whereas ADF is often slower (but can handle a lot more data and more parallelism, and isn't so highly coupled to data schema). I'm not sure if that will matter to you. I think Synapse serverless SQL pool is still an option to consider, but I would try some of your queries and Power BI model refreshes to make sure the query times are acceptable. Do a quick POC and make sure it has the features you need.
From a long-term product perspective, we can see Microsoft putting a big focus on Synapse and Power BI and making those two tools work well together. But you'll have to decide if all the features you need/want are there today with the serverless SQL pool.
November 16, 2021 at 9:39 pm
Thanks for your reply Meagan. I just wanted to update you on the approach that we have decided to go with. Here is a summary of our overall approach:
While our data volumes are so low (curently 10 GB), we will point our Power BI tabular data model at the Delta Lake tables. We will eventually move to Azure Synapse when our data volumes warrant the move to a more scalable architecture.
Would be curious to hear your feedback.
Thanks again.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply