Blog Post

SQL Saturday Oregon & SW Washington 2024

,

Wow I am so late in posting these notes! But this conference was amazing, so figured still good to post my notes from the incredible sessions I attended! Hope everyone had a great holiday season with good people and great food!

Here’s a link to my slide decks and resources from my presentation on performing data quality checks using python notebooks in Fabric: https://github.com/Anytsirk12/DataOnWheels/tree/main/2024%20Oregon%20%26%20Washington%20SQL%20Saturday

And here’s a link to the event information: https://sqlsaturday.com/2024-11-02-sqlsaturday1082/

Why Can’t AI Do My Dishes Already? Kellyn Gorman

“I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do my laundry and dishes” – Joanna Maciejewska.

What is AI good at? Automating repetitive, time-consuming tasks.

Is it being used for the right thing?

Concerns around AI: reputation hit. Intrusion in privacy, failures in consent, algorithmic bias, and explainability are all potential issues that can cause a hit to an organization’s reputation.

The poisoning of the data pools that AI pulls from is a big reason why it may be unusable in upcoming years. It will need to evolve.

“The easiest way to mismanage a technology is to misunderstand it”.

Impractical is interesting, but the dishes are necessary. The majority of what AI can do is impractical. AI Graveyard is evidence that there’s thousands of AI projects that have been abandoned – https://dang.ai/ai-graveyard

Pretty cool use case or AI is to identify species of animals and movements by the footprints (Wildtrack). Ambio tracks boats in Zambia to notify officials of potential poachers. It also looks for water level to see if there’s anything dangerous to animals. Misc AI from National Wildlife Federation are identifying lions, whales, seals, and trout. The overall goal is to aid animal population regrowth.

Another use case is tackling human trafficking. This one has a large ethics component. Veritone is able to do this without exposing PII.

Weird use case is AI trained ability to create pickup lines.

AI healthcare and surgical advances is another area of improvements. 3D anatomical overlays, robotic surgery guidance, detection of cancer, efficiency of care centers, preoperative risk assessment on candidates for organ transplant, etc. Amazing way to help healthcare professionals do more with less.

Another cool use case is drones and AI in preventing forest fires. They are able to swarm and can carry water. No human intervention is required which protects pilots and firefighters.

AI won’t do your dishes anytime soon.

Fabric Semantic Link Labs: A link to the future. Jason Romans

https://github.com/thedaxshepherd/Presentations/tree/main/Events/SQL_Saturday/2024

Semantic link is intended to make a connection between semantic models (models, reports, lakehouses, workspaces, etc) and synapse data science (Fabric notebooks). Semantic link and labs are only available in Microsoft Fabric.

Semantic link labs uses semantic link and is essentially an expansion pack. It’s open source and you can clone it from GitHub repo. It’s also under active development. Semantic link labs can do a lot for you built off the bones of semantic link – migrate to direct lake, best practice analyzer, manage reports, view broken reports, etc.

To use, you’ll have to import the module “import sempy.fabric as fabric”. Semantic link uses pandas dataframes. You can show a relationship view (super cool) and list out all your artifacts in your environment. Semantic link labs is installed using %pip install semantic-link-labs then import sempy_labs as labs. Another way to do this instead of using a notebook is to use an Environment for the workspace. Environments are an easy way to lock in specific versions of libraries and allows all your notebooks to have all the libraries installed that you want. Be sure to publish an environment if you’re creating it. Save button does nothing.

One of the big use cases is migrating to direct lake. If going through these steps, you need to enable the Read Write XMLA setting in the admin portal.

  1. Create a notebook
  2. Install the package semantic link labs
  3. Import model
  4. Need a lakehouse and connect it to the notebook
  5. Set parameters
  6. Create Power Query template
    • You can sync OneLake using the OneLake app! Then you can see the file on your desktop like you would from OneDrive.
    • The template is generated for you using the semantic link lab
  7. Import the power query template to dataflows gen2
    • Easiest way is to go to the lakehouse and create the dataflow from there.
  8. Run the dataflow to populate the data into the lakehouse instead of having to rebuild from scratch.
  9. Create direct lake model – this will also refresh the model
  10. Show migrated/unmigrated objects
    • Great way to check and ensure all your objects got moved over correctly
  11. Rebind reports to the new data model

There are sections of code in the notebook containing scripts that do the following:

  1. Show unsupported obejcts
  2. Schema check between semantic model versions
  3. Show directquery tables

Limitations – calculated tables are not supported. The column used for relationships cannot be datetime, needs to be an integer. Calculated columns are not supported.

There are tons of documentation in the notebook 🙂

Current additional uses:

  • Best practice analyzer
    • collects stats for all semantic models within all accessible workspaces (labs.run_model_bpa_bulk(workspace=’Name’) )
    • import sempy_labs.report
    • Creates a direct lake semantic model for analyzing the results (labs.create_model_bpa)
    • You can also automate this through a pipeline
    • You can also track improvements over time via the run id
  • Capacity migration
  • Migration to Direct Lake from Import
  • Model optimization (vertipaq analyzer)
  • Query scale out (enable/disable large semantic model format)
  • Semantic Model Refresh
  • Tabular Object Model integrations

https://github.com/microsoft/semantic-link-labs

How to use AI, in your professional and personal life by Tonie Huizer & Peter Kruis

Considerations: Hallucination or Facts? Bias? Good Instructions? Is AI still in your context? Real-time or outdated library?

Crafting prompts:

  • Guiding the model
  • maximize output quality
  • support complex tasks, in different iterations

But how? Let GTP help! Use the “official chat GPT guide”: will be in his github. Sometimes prompting too much at once will cause it to miss certain pieces you asked for.

You can make custom GPTs from Chat GPT in the paid version. You can add in files so it has a better starting point and a clean pool of data to pull from. You can also add in information from APIs.

You could use ChatGPT to “reverse engineer” a stored procedure script and it’ll explain all the various aspects. Pretty cool. Then you could ask it for “a more detailed breakdown including code parts”. Having code parts is awesome cause it’ll show you snippets of the original code. But make sure to validate! It can hallucinate when generating new code. Another great use case would be or it to do a code review and it’ll provide potential improvements. It’s also nice to use Chat GPT for documentation with “describe the flow in technical documentation”. Still validate the output. It’s very very good at translating, so it’s great to have it translate documents into other languages if needed. It can also make test cases for you! You can use those test cases as part of your user acceptance strategy. Super useful for creating POCs – example is to create a html webpage to gather evaluation from sessions presented on a conference with a dark and light theme. Pretty cool! You can ask Chat GPT to put the code in a file that you can download. Amazing to see it just work! Definitely good for POCs but probably don’t go live with it. Repetitive work is another great use case. Something like create CRUD stored procedures for this table and provide table script for it to know the columns and column types.

Demystifying Delta Parquet by Jarid McKenzie

https://foundatum.ca/

OLTP is optimized with Row Order for CRUD operations. OLAP is optimized with Column Order for analytical workloads.

Parquet is column storage. Files are meant to be immutable – they are not made to be changed and they should not be changed. This enables parallel reads. Parquet order is column storage, with some row order that aligns with files. This allows for nested data structures, concurrent reads, data compression, and fast filtering/aggregations.

Nested data – not applicable for PBI since it’s not readable by PBI, but it’s essentially levels of data. Not very common since most things get flattened instead of nested. Nested isn’t really worth the squeeze.

Compression & Encoding – bread and butter of what makes parquet the future. Snappy compression = Fabric default. Compresses both dictionary pages and data pages. Encoding is how space is saved within the files. Plain = non-compressed, think long unique text strings. Dictionary encoding & run length encoding are also what Vertipaq in PBI uses. Dictionary encoding replaces datatypes with dictionary and indexes. This is typically used with higher cardinality columns. Run length encoding is when a lot of repeated values, or low cardinality columns, (think Q1, Q2 in a calendar) is stored as a table with Quarter value, the start record, and the count of how many contain that value. i.e. Q1 | 1 | 310 would mean that the Q1 value starts at record 1 and there are 310 of them. It runs very fast for filtering. Low cardinality = high efficiency.

Parquet footer – stuff at the bottom of the file. Contains stats, mins, maxs, page indexes, etc. Points to different pieces of the file to make reads even faster. Bloom filter is a very space efficient data structure. It responds with either definitely no or probably yes. Probably because it’ll look for data and is occasionally wrong and something doesn’t exist.

Delta provides ACID compliance. Atomicity, consistency, isolation, and durability. Major difference from SQL and Delta is the concurrency model. SQL is pessimistic concurrency with various locking escalations. This method expects lots of people writing to the underlying dataset. Delta does optimistic concurrency and assumes that all data is appended, not overwritten, and there aren’t multiple users making changes. Delta is better for batch loading, real-time streaming can be harder because it hits more issues with that optimistic concurrency.

To avoid the small file problem, you can run something called “OPTIMIZE” which will collapse all the existing “live” files into a single file so there’s only one current file that the delta table is referencing. You can use DESCRIBE HISTORY to look at the history of all the transformations that have happened.

You can time travel, but do not depend on that for auditing purposes. Vacuuming by default (once enabled on a delta table) removes every parquet file older than 7 days and is essential for maintenance. The old parquet files can be useful for restoring if needed, so don’t vacuum too often. You can call a VACUUM command manually. Recommended to do a DRY RUN because then it’ll show you what files will be deleted. Even if you vacuum everything, it will maintain the current parquet file but ONLY that file. If you need to maintain history, do this as a SCD in a totally different table. Vacuum only cleans up parquet files but the delta log is still there. That means you can still describe history commands.

Cloning, you can do shallow or deep cloning. Shallow clone = no parquet files stored with that table, it’s a direct reference to the original at that point in time. Metadata only operation. If the vacuum is done on the file that the shallow clone was based on, it will break the shallow cloned table. Deep clone = true copy of all the metadata, good option for a backup (likely a very locked down file location).

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating