DataTune 2024 was a huge success! Thank you so much to the organizers who spent countless months planning and making this an incredible conference! It was so great to meet so many people and hang out with so many fellow data nerds!
GitHub link for my session: https://github.com/Anytsirk12/DataOnWheels/tree/main/DataTune%202024
Conference link: https://www.datatuneconf.com/sessions.html
Microsoft Fabric for Power BI Users by Alex Powers
Put your data in OneLake and do everything you want. The “OneDrive for data”. Every org has ONE OneLake. You can share across workspaces via datalakes.
There’s a tons of new Fabric items, what is important for PBI users?
Goal: Get data (OneDrive) > ingest (Dataflow Gen 2) > store (Lakehouse/semantic model) > deliver insights (PBI) with data pipelines to orchestrate.
Start by creating a lakehouse. Tables are optimized delta/parquet format. Files is unstructured. You can create sub folders within those as well. There’s a OneLake file explorer that lets you drag and drop files into the OneLake.
Next is a dataflow. If you want to copy and paste M queries, you can hold down control on the queries and control C then control V directly into a cloud dataflow! When merging tables, there’s a light bulb in top right that will suggest the join option for you so you don’t have to scroll anymore. The visual dataflow allows you how to see the lineage of various tables. Be sure to stage the changes then publish.
Next is to publish to the lakehouse we made before. You get a chance to double check the sources and make sure all the data types work with parquet format. You can also choose to append or replace on refresh. You can also allow for dynamic schema so it’ll pick up any name changes to columns! Lightning bolt on table means that it has been optimized. Right now this is all low code no code, but there may be some announcements at the Vegas conference at the end of the month to allow for writing code instead of using the point and clicks.
Next build a data pipeline. 95% parody to ADF pipelines. Create a dataflow step then configure settings to grab the new dataflow. Add a teams activity so when it fails it will send a teams notification. Just need to sign into teams and allow access. You can send to a channel or group chat. The UI is a lot like Power Automate.
To set up refresh, go to pipeline and schedule there! It’s way way easier on the schedule options. Soon there will be an option to refresh a semantic model.
To create shortcuts, go to the lakehouse and create a new shortcut. You can even reference items from other workspaces/lakehouses. Right now, you can also do a shortcut to AWS and ADLS gen 2.
In lakehouse, you can create a new semantic model. XMLA editing is supported! You can create your DAX measures there as well. You can even then create a new report from this semantic model! No need to import any data. Now you can do this all on Mac, Raspberry Pi, Linex, any OS with a browser. You can also use the SQL analytics endpoint for a SQL experience for navigating the lakehouse in a SQL manner that lets you run and save as views or create sps (upper right corner in lakehouse view). It is case sensitive.
Microsoft Fabric + Power Platform + Azure Open AI = Magic Mix by Alex Rostan & Gaston Cruz
Goal how to combine technologies together. Focus will be starting with a Power App and end with data in Fabric.
Check out their YouTube channel! https://www.youtube.com/@PowerMatesMS
Start off by making a Power App using copilot. Prompt = let’s create a new application to audit power bi licenses. Power bi pro and premium per user licenses tied to our organization. It creates a dataverse table within the environment. You can edit this by using another prompt = add a new column to check the latest access to each user for our Power BI Tenant. It created a new column for us! Then just say create app. And boom we have a functional application!
In the tables tab, you can click Analyze and there’s an option to Link to Microsoft Fabric. Dataverse is a service, so depending on the use case it’ll make the correct configuration to optimally grab your data. That would allow you to create Power Apps that people input data into then instantly see it in Fabric in real-time.
Now we go to Fabric. Data verse is a native connection in data pipeline. Use dataflow for ETL and pipeline for orchestration. Then determine where to drop data. You can choose Lakehouse (structured & unstructured data), KQL database (near real-time), or Warehouse (structured data).
Data activator is the way to go for alerting.
Once you create semantic model, you can create a report on top of it. This will use Direct Lake mode (perfect combo of import and direct query). No refresh needed. Lakehouse holds the data for the model to leverage. Report will automatically grab data from there.
The Problem with Pipelines: Building Extensible Data Code by Matthew Kosovec
Goal is to take software engineering principals
Pipelines can turn ugly fast when the requirements change or when introducing new source systems. You can easily end up with code that isn’t used.
Software Engineering concepts:
- DRY (don’t repeat yourself)
- SRP (single responsibility principle)
- Pure Functions Only
DRY – Every piece of knowledge must have a single, unambiguous, authoritative representation within a system. Ways to dry off: CTEs, views/materialized tables, macros/UDFs (user definied functions), ETL packages. Challenges: Encapsulation and composability (how generic is your code snippet that can be reused), dependency management, refactoring and regression testing. Smallest unit of work is a single SELECT statement (table, view, etc.) which can be the challenge of encapsulation.
DRY example – you have a feature engineering table need that is the summation of a BI table. To keep this DRY, you can build the FE table on top of the BI table to avoid the code duplication. The con here is that FE is now dependent on BI code. The additional layer can also create slower refresh cycles. This may also not be possible always because the relational logic may be different enough.
SRP – separation of concerns. Gather together the things that change for the same reasons. Separate those things that change for different reasons. Medallion architecture is a common approach. Pros is helps separate concerns between layers and provides a baseline for architectural gov and reviews for pipeline design. Cons are lacks structures for the most important part of pipelines (transformational code!) and often devolves into multiple sub-layers of storage to help with DRY or performance.
Pure Functions Only – data pipelines are imperative. Focuses on step by step program. Uses statements that change a program’s state. Goal is to have zero side effects, ultimate composability, DRY by default. Because you can reuse everything in small bits, you don’t need to rewrite your code over and over.
Example – Pure functions: modifies a single row in a single column with unlimited input columns/rows and can create a new column. Example of that is Scalar calcs, UDFs, and rollups to current grain. Semi pure functions: modifies multiple rows in a single column, unlimited input columns/rows, can create a new column. Example is window functions like ranking. Impure/non-functional: modifies multiple rows in multiple columns, creates multiple columns. Examples are CTEs/procedures, unpivot, CDC/refresh processes.
Making code DRY and Functional:
- Use SRP to break down SQL into small, functional, composable parts (SELECT, FROM, etc.)
- Build a system to check for DRY-ness and ensure referential integrity between objects
- Develop code generator to create runnable SQL from objects
- Use SRP to define processing steps to separate functional from non-functional (bonus step but has some serious benefits by repeating the process)
DataForge example:
Step 1 – identify the composable parts. Examples: aliasing (columnA as A), scalar calculation (ie CAST (A as int)), aggregation (existing grain aka rollup), aggregation (new grain), Filter (WHERE), all joins (including cardinality), etc. Sorting those into object groups. Rules = aliasing, scalar calcs, aggregation rollup. Channel mapping = filter, aggregations (new grain). Relation = joins. Not comprehensive, there’s tons.
Step 2 – design and build a meta-repo database to store and relate objects. Build monitoring services to check for duplicates on save and compilation. This gets stored.
Step 3 – Take the broken down model and use a SQL Code Generator which can write DDL (change tables) then Pure SQL (do all the SQL at once, no need for intermediate storage) then Impure SQL (everything else). The development is no longer the entire table, you write a function that can create the pieces you need. The system can do a bunch of checks to see if there’s any downstream effects or if it already exists. Then it gets orchestrated.
There are vendors who build data mappings from source data to target system and build adaptors that can work with DataForge. Matthew says this is fairly easy to do on your own as well. The goal of this is that when you make changes, you aren’t too worried since there are checks for duplication and reuse.
Step 4 – blow up the medallion architecture. Non-functional/imperative (ingest – parse – capture data changes) > pure (enrich) > semi-pure (refresh recalc – output) > imperative (custom post output). Everything goes through every step. You end up not needing orchestration. Grouping pure functions together removes that need and helps prevent the eventual chaos/spaghetti code.
This isn’t about building the first thing. For that, pipelines are pretty straightforward. This is more about adding to it. You end up building your lineage through the metadata which allows for increased observability.
DataForge has an out-of-the-box solution to this – www.dataforgelabs.com. This solution works with existing pipelines. Under the hood, this is databricks so it can integrate with a lot of different code bases. They have two ways to do this – an IDE/UI (tons of checks with real-time feedback) and a YAML config language which then is compiled in their system through an import process where the checks are then performed. On the roadmap is a CLI to compile locally. There’s a private cloud version and a SAAS version.