First of all, wow! What an incredible event! Loved seeing everyone and meeting so many wonderful new people. Thank you to everyone who organized, spoke, and attended this event! Below are my notes from the sessions I attended as well as a link to my github with all the resources from my talk. Thank you again to all those who came to my session, you were all an amazing crowd and I really enjoyed learning with you all!
Link to the event and speakers: https://sqlsaturday.com/2024-02-10-sqlsaturday1071/#schedule
Introduction to Fabric Lakehouse by Shabnam Watson
For SQL Server users, a lakehouse will be used for landing zone and the data warehouse will be used for silver and gold layers. Currently, no demonstrated difference in price/performance between lakehouse and warehouse. Raw/Bronze = likely a lot of small files. Use with a lakehouse. Lots of ways to do (or not do) medallion architecture.
Cost is now capacity based. If you go over your capacity, there’s some smoothing but if you’re still over then they will throttle performance until smoothing catches up with your existing capacity. There is no per user license for Fabric items. F SKUs can be purchased on Azure portal, but P SKUs are through M365 licenses. F2 jobs will not be killed in the short term and the experience will match a F64. Allows you to play around for a cheaper price.
Parquet file formats are not deletable, so Delta tables are key. They build on top of parquet to soft delete (it toggles that parquet file as inactive). You can time travel because of that, but there is a cost to that since it requires a lot of storage. Fabric engines write Delta with V-order (Vertipaq). Data is automatically optimized for PBI VertiPaq engine. For example, Databricks writing to a datalake will run slower in PBI than a Fabric notebook writing to that datalake since it’ll be optimized.
Lakehouse works directly with files in the lake (workspace). Data ingestion options: Low Code/No Code (pipelines, dataflows gen2, shortcuts) and code first (notebooks, pyspark, sparkSQL). Data transformation uses the same tools. Data orchestration is through pipelines or notebooks.
Data load considerations
Use Case (Recommendation)
Small file upload from local machine (use local file upload)
Small data (dataflows)
Large data (Copy tool in pipelines)
Complex data transformations (Notebook code)
Use pipelines to orchestrate, not to transform. Use dataflows or notebooks to transform data. Notebooks can optimize your data files into larger files for faster runs.
Power BI import mode will still be fastest and direct query will still be more real time, but direct lake will allow for near real time and nearly as fast as import since there are no more scan operations. Limitations with direct lake – web modeling only (desktop is coming), it will fall back to direct query when capacity runs low which will slow your reports.
You can tie down security for the sql objects that are query-able via the SQL endpoint. You can query cross-workspace, so you can lock down the editing of data sources in a different workspace from where people consume it. Spark runs much faster in Fabric than in Synapse because the clusters are always up in Fabric so it doesn’t need to spin up a cluster to run your workload. Lakehouse land requires notebooks to interact with it. You can toggle to the SQL analytics endpoint within the service to build relationships, create measures, stored procedures, views, etc. Visual query allows you to drag and drop and right tables to generate sql behind the scenes and allows you to save that sql as a view. You can also manipulate the data using M queries. You can use CICD currently with semantic models, report, and notebooks.
How to Weave DataOps into Microsoft Fabric by John Kerski
Goal is to make production easier by combining principals from DevOps, Agile, and Lean manufacturing (every transformation steps needs to be tested for quality). If you check early, you’ll beat the customer. John has a video on this on YouTube as well.
GitHub: https://github.com/kerski
GitHub for this subject: https://github.com/kerski/fabric-dataops-patterns
Principals:
- Make it reproducible using version control
- Quality is paramount (TEST
Pattern 1 – DAX Query View Testing Pattern
Prerequisites: dax query view, PBIP format, Azure DevOps/Git integration (Visual Studio Code)
Use a naming convention for the tabs because we can get this out from the pbip file later. Recommended to use Measure.Tests and Column.Tests. You’ll also want to apply a schema like – TestName, ExpectedValue, ActualValue, Passed. John has a template for these tests, it’s essentially a union of rows. When you build the test cases, use the performance analyzer to grab the dax from a KPI card visual that has your testing values.
Good examples of this are making sure your date table has the correct number of days or your yearly sales are within a threshold. This only really works if you have expected values. This can be achieved by having a static dataset that eliminates variables of error to ensure your DAX is or is not the issue.
You can also test if your table still has unique values, or the % of null values in a table. DAX query view has a quick query feature to get the column features. AMAZING. You can now use the INFO.TABLES() DAX function that came out in December to grab the schema and see if it’s changing (which can break visuals). This is also vital if a data type changes and it breaks the relationship. You do need to hard code and define the schema of the table (easily create this using another script).
If you save this in the PBIP file, within the .dataset folder all your DAX queries are stored in a folder called DAX Queries! That means you can easily see if a test has changed during the PR and also easily add these tests to other models. May be wise to have sample set of tests with instructions for set up that you can easily copy/paste into other existing dataset folders.
You’ll need your workplace governance set up (dev, test, prod). Standarize your schema and naming conventions for the tests (name.environment.test). Build tests – test calcs, content, and schema.
Pattern 2 – Gen2, Notebook Automated Testing Pattern
Devops stores DAX queries > Dataflow Gen 2 can call those queries > store results in Lakehouse. Leverage notebook to orchestrate and output back into Lakehouse for PBI report to consume.
Push the pbip into DevOps git repo so DevOps has the DAX queries. You do need to have environment variables (workspace guid, branch) called out in a config file (John has instructions in his github). You can use Azure DevOps Services REST APIs to grab the items in a project/repo from a dataflow. Dataflow is called API – Azure DevOps DAX Queries – Gen 2. Lakehouse consumes the data from dataflow then we can have the notebook use Semantic Link to query a dataset using the DAX queries. You do need to have the public library for semantic link preloaded instead of using pip install. John’s sample contains easy variables to set for the branches.
Not using data factory in Fabric to call the APIs since there are a number of APIs missing for that product at the moment, but you can use a pipeline to run the dataflow and notebook in a scheduled fashion. You can work with DevOps pipeline to call this as part of a PR requirement.
Data activator could be used to trigger alerts. Be sure someone is actually monitoring the report tool since email is not super reliable.
Modernizing ETL: Leveraging Azure Databricks for Enhanced Data Integration by Joshua Higginbotham
GitHub: https://github.com/Jhiggin/Common_ETL_Patterns_Using_DataBricks/tree/main
Not everything here is transferable to Fabric, but a lot of it is.
What is Azure Databricks? Unified data analytics platform. Contains data engineering, analytics, ML, AI.
Common Architecture: ingest using ADF > store in ADLS > prep and train in Databricks > Azure Synapse and AAS to model and serve.
Microsoft recommends ADF does extraction (E) and databricks for transformation (T), but there are reasons to do more of the E within databricks.
Delta Lake Archicture: bronze/raw > silver/filtered > gold/business-level aggregates. Delta lake allows you to incrementally improve quality of data until it’s ready for consumption. Silver could contain a datalake, lots of ways to do this medallion process.
New to Databricks? Check out DataCamp and MS learning pathways. Josh can provide some resources for his favorite people to follow.
You can deploy databricks as a set environment which can make it easy to avoid egress charges or latency that would be caused by having the tenant in a region different from the data.
You have to set up access to the workspace and the items within it. Recommended using security groups for this. You can link this to source control (git hub and Azure DevOps) and do everything including PRs within the databricks UI.
Compute has lots of options. All-purpose can do everything, but may not be fastest nor cheapest option. Summary tells you how many workers, run time, memory, cores, and features that are enabled. Photon enabled is much faster. Tags are super useful for assigning accounting details so you can assign spend to teams. Don’t mess with your configuration unless you know what you’re doing. One useful config is azure data lake storage credential passthrough to pass creds through, but now it’s recommended to use unity catalog. SQL warehouses now have a serverless option that can allow you to have quick spin ups and auto shut downs when not in use to save money. Policies set up limits for scale or who can use what compute.
A mount point can tell databricks where the data is stored within ADLS. You can mount it once and can make your code much more readable and secure without the full ADLS path. dbutils.secrets.get allows you to grab information from keyvault without any access to the actual values.
Recommendation – explicitly set your schema, don’t allow databricks to guess.
Recommendation – don’t worry about partitioning delta lake until you reach 7 million records since the gains are minimal.
When you use delta tables, you can look at the delta log to see changes and time travel.
You can use %sql DESCRIBE DETAIL table to see meta data for a table.
You can use Partner Connet to open data in Power BI desktop. Keep in mind that if it’s too large it will connect with direct query which means your cluster will always be up and costing you a lot of money.
%sql DESCRIBE HISTORY will let you see what the data looked like as of a certain date. Very dependant on how much data you’re storing/pruning.
You can create CLONE s off of a delta table, so you can clone the table as of a certain time. This can also create snapshots.
Does Clone persist with vacuum? Shallow clone is a metadata copy, deep is default which also copies the full data. Unsure if either are impacted by vacuum, but the idea with clone is that it’s not persisted typically.
Configuration_Driven_Development_Demo is another notebook he walked us through (see github). He has a config file in yaml that can config for a certain client (file path and schema for the table and transformations). It does a lot of things including mapping SQL data types to spark data types. This pipeline is for more repeatable patterns for ETL.
Power BI and External Tools: This is the way! by Jason Romans
Goal is to make you a jedi of Power BI.
History of tools – back in the day you would start with a project in visual studio. Behind the scenes it was mostly editing the JSON file. Tabular Editor came out as a glorified json editor.
SSMS (SQL Server Management System): dependable and very useful for SQL. No DAX formatting, no column names displayed. It can process (full refresh, recalculate, etc) premium datasets and you can schedule that using sql jobs. Can’t connect to PBI desktop. Great for a dba.
Azure Data Studio (ADS): shiny but not the tool you’re looking for. Cannot connect to analysis services nor pbi desktop.
DAX studio: an analyst. Has intellisense, links to dax guide, dax formatter, performance tuning. You can see hidden date tables. Select “Server Timings” in DAX studio then run a query to get all the server timings to understand where it’s spending it’s time. You can also use DAX studio to compare two versions of a measure to see which runs faster.
Tabular Editor: the builder or tinkerer. There is a free and paid version. Create and edit measures & calculation groups. You can also do C# scripting and automation (my favorite!) and command line integration! Can help with source control. Tabular Editor github has tons of easy to use scripts you can use for common BI tasks (Tabular Editor GitHub and Useful Scripts Blog). The best practice analyzer can easily apply fixes and script out a fix.
PBI Explorer – navigator. Able to explore the changes. Pbip format is new and compares projects using vscode. Gitignore file is in pbip to tell git to ignore the data itself and only bring in metadata. PBI explorer can visualize the report differences. You don’t need the project file format to use PBI explorer. To run it, you have to run the executable – not located in the external tools tab in power bi desktop. It can actually visualize the changes. There’s additional functionality to even explorer and analyze the visuals.
Bravo – https://bravo.bi/. Jack of all trades. This one seems to fill in the gaps. It can help optimize data model by looking at unused columns, size, cardinality. Allows the adding of custom date tables. Exports data. First screen analyzes the model for you. Also allows you to format dax. You’ll get a warning for two things on date tables – 1 is if you still have auto date/time enabled, 2 is if you have a Date or Holiday table already in the measure. If you give the bravo date table a new name, it will resolve issue number two. The bravo date table is a DAX calculated table.
Semantic Link – A Magic Cauldron to Unlock Power BI with Python by Stephanie Bruno & Stacey Rudoy
Blog – https://data-witches.com/
Content at – bit.ly/ILDpres
Check out fabric.guru for the inspiration behind this session. Here’s the specific blog post relating to semantic link and a data catalog (https://fabric.guru/fabric-semantic-link-and-use-cases#heading-what-about-power-bi-developers).
Semantic link is a Fabric feature announced in Dec 2023. https://learn.microsoft.com/en-us/fabric/data-science/semantic-link-overview
Features – we can evaluate DAX queries, get metadata, and take actions like kicking off a refresh.
Two ways to install – pip install semantic-link or create an environment. When you create an environment, you can add libraries and configure Spark properties. Once created, you can set a notebook to use it.
Why use it? Document the tenant (semantic model catalog), save snapshot of semantic model, download usage metrics.
Data Quality – there’s a find dependency script that allows you to see what columns rely on others. You can also use the Great Expectations library.
For REST APIs, use the FabricRestClient to call the Fabric REST endpoints.
Usage metrics! With semantic link, you can loop through all the workspaces and query the model to dump the data directly into a Datalake! You cannot connect to the Admin monitoring workspace using semantic link (not supported currently). So excited to be able to loop through all the workspaces one has access to!
Before running a notebook in Fabric, you need to create an environment. Add a python library for semantic-library. Then save and publish and you can use this within the notebook. If you don’t use environment, you’ll have to pip install semantic link. There is a learn module about the semantic link library (sempy fabric read table). If you call datatsets() without anything in the parenthesis it will grab datasets within the workspaces that the notebook lives in now. You can also plot relationships.
You can also query the DMVs if needed.
Ooo there’s an option for read_table so you don’t need to evaluate a dax expression to return a Fabric dataframe. To send to lakehouse all you need to use is to_lakehouse_table function and it creates or appends data to a table in lakehouse. Keep in mind, you’ll want to remove spaces for lakehouse functionality. You can only use to_lakehouse_table if your dataframe is a Fabric dataframe. Otherwise, you’ll needto use write.format(“delta”).mode(“overwrite”).
Phil Seamark has a great blog about this as well: https://dax.tips/2023/12/05/visualize-power-bi-refresh-using-sempy/.
Power BI – Performing Data Quality Checks Using Python & SQL by me
Abstract:
Picture this, you have a report in Power BI that someone passes off to you for data quality checks. Simple enough until they clarify they need every measure checked against the source database.
There are a few ways to make sure the measures match what is in the source data system, but you need something easily repeatable and self documenting. In this presentation, I will walk through how to use python and excel to perform our data quality checks in one batch and document any differences. In order to do that, we are going to build a python script that can run Power BI REST APIs, connect to a SQL Server, and read/write to Excel.
By the end of this session, attendees will have a plug and play tool to check data from SQL against Power BI.
Link to GitHub with Power Point and final scripts: https://github.com/Anytsirk12/DataOnWheels/tree/main/SQL%20Saturday%20ATL%20-%20BI%202024