Creating a Unity Catalog in Azure Databricks

Unity Catalog in Databricks provides a single place to create and manage data access policies that apply across all workspaces and users in an organization. It also provides a simple data catalog for users to explore. So when a client wanted to create a place for statisticians and data scientists to explore the data in their data lake using a web interface, I suggested we use Databricks with Unity Catalog.

New account management and roles

There are some Databricks concepts that may be new for you in Azure when you use Unity Catalog. While Databricks workspaces are mostly the same, you now have account (organization) -level roles. This is necessary because Unity Catalog crosses workspaces.

Databricks account covers all workspaces in your Azure tenant. There are 3 account-level roles: account admin, metastore admin, account user. Underneath the account are one or more workspaces. Workspace roles include workspace admins and workspace users. — There are two levels of accounts and admins in Azure Databricks

Users and service principals created in a workspace are synced to the account as account-level users and service principals. Workspace-local groups are not synced to the account. There are now account-level groups that can be used in workspaces.

To manage your account, you can go to https://accounts.azuredatabricks.net/. If you log in with an AAD B2B user, you’ll need to open the account portal from within a workspace. To do this, go to your workspace and select your username in the top right of the page to open the menu. Then choose the Manage Account option in the menu. It will open a new browser window.

The user menu in a Databricks workspace with the fourth option, Manage Account emphasized. — To launch the account console, choose Manage Account from the menu under your username in a workspace

Requirements

To create a Unity Catalog metastore you’ll need:

A Databricks workspace configured to use the Premium pricing tier
A storage account with the hierarchical namespace enabled
Azure Active Directory Global Administrator privileges (at least temporarily, or borrow your AAD admin for 5 minutes during setup).
Contributor or Owner role in a resource group where you can create an Access Connector for Azure Databricks

The pricing tier is set on the basics page when creating the Databricks workspace.

The Basics page of the Create and Azure Databricks workspace workflow in the Azure Portal has a pricing tier option near the bottom. Make sure this is set to Premium (+ Role-based access controls). — If you plan to use Unity Catalog, be sure to select the Premium pricing tier for your Databricks workspace

High-level steps

Create a storage container in your ADLS account.
Create an access connector for Azure Databricks.
Assign the Storage Blob Data contributor role on the storage account to the managed identity for the access connector for Azure Databricks.
Assign yourself account administrator in the Databricks account console.
Create a Unity Catalog metastore.
Assign a workspace to the Unity Catalog metastore.

The storage container holds the metadata and any managed data for your metastore. You’ll likely want to use this Unity Catalog metastore rather than the default hive metastore that comes with your Databricks workspace.

The access connector will show up as a separate resource in the Azure Portal.

The Azure Portal showing two resources in a list. The type of the first resource is Access Connector for Azure Databricks. The type of the second resource is Azure Databricks Service.

You don’t grant storage account access to end users – you’ll grant user access to data via Unity Catalog. The access connector allows the workspace to access the data in the storage account on behalf of Unity Catalog users. This is why you must assign the Storage Blob Data Contributor to the access connector.

The Add role assignment page for Access Control on the storage account. The selected role is Storage Blob Data Contributor. Assign access to is set to Managed Identity. The member listed is DBxAccessConnector, which is the access connector for Azure Databricks. — Assigning the Storage Blob Data Contributor role to the managed identity for the Access Connector for Azure Databricks

The confusing part of setup

If you are not a Databricks account administrator, you won’t see the option to create a metastore in the account console. If you aren’t an AAD Global Admin, you need an AAD Global Admin to log into the Databricks account console and assign your user to the account admin role. It’s quite possible that the AAD Global Admin(s) in your tenant don’t know and don’t care what Databricks or Unity Catalog is. And most data engineers are not global admin in their company’s tenant. If you think this requirement should be changed, feel free to vote for my idea here.

Once you have been assigned the account admin role in Databricks, you will see the button to create a metastore in the account console.

The Databricks account console has a large green button labeled Create a metastore, which is only visible to account admins. — The Create a meatastore button is only available for Databricks account admins

One or multiple metastores?

The Azure documentation recommends only creating one metastore per region and assigning that metastore to multiple workspaces. The current recommended best practice is to have one catalog that spans environments, business units, and teams. Currently, there is no technical limitation keeping you from creating multiple metastores. The recommendation is pointing you toward a single, centralized place to manage data and permissions.

Within your metastore, you can organize your data into catalogs and schemas. So it could be feasible to use only one metastore if you have all Databricks resources in one region.

In my first metastore, I’m using catalogs to distinguish environments (dev/test/prod) and schemas to distinguish business units. In my scenario, each business unit owns their own data. Within those schemas are external tables and views. Because it is most likely that someone would need to see all data for a specific business unit, this makes it easy to grant user access at the schema level.

I’ll save table creation and user access for another post. This post stops at getting you through all the setup steps to create your Unity Catalog metastore.

If you prefer learning from videos, I’ve found the Unity Catalog videos on the Advancing Spark YouTube channel to be very helpful.