Batch, Streaming, and Relational Data

This is part of a series on my preparation for the DP-900 exam. This is the Microsoft Azure Data Fundamentals, part of a number of certification paths. You can read various posts I’ve created as part of this learning experience.

The first part of the DP-900 skills document has these items:

describe batch data
describe streaming data
describe the difference between batch and streaming data
describe the characteristics of relational data

These are concepts that are important to this exam. I lightly blew these off when I started studying, but every other person with guides and the practices tests has lots of focus here. I’m glad I spent time here.

This post covers these concepts a bit. Note, these are more ETL/analytic concepts, not really

Batch Data

Most of my career deals with batch data, meaning a bunch of data that arrives at once and is imported into a system. This is different than a connection and query submitted to an OLTP system. The general idea is:

Lots of data
Processed periodically
Latency doesn’t matter.

Think these key words:

Not real-time
periodic
large/big/lots

There is an MS Docs article on this. The general idea is that you want to think about a scheduled (or some periodic) processing of lots of data for a purpose.

Examples of where batch is used.

Total up all hours worked last week for employees
Load and transform log files from all web servers each day
Import files from regional offices into a main database server

In the analytics space, you’d be using Azure Data Factory (ADF), HD Insight (U-SQL, Hiuve, Pig, Spark), Azure Data Lake (ADLS).

Streaming Data

There is a course on this topic. When you think of streaming, think of these key words:

real-time
stream
data processed as soon as created
IoT
few transactions
monitoring or instant decision making

Streaming is really about time series, about tumbling windows, about data like a stock ticker that you need to constantly and/or quickly process.

Differences

These items helped me:

Lots of data – Batch
Low latency – Stream
Long latency, latency doesn’t matter, periodic work – Batch
Small, constant sets of data – Stream

Relational Data

The workload here is that you are handling regular changes to data, lots of insert/update/deletes, for a business process. Really this means you are thinking some sort of CRUD application the gets and sends data to users in real time, but not with low latency issues. We are thinking a web server, a data entry business app, something that operates on time scales for humans, seconds. Not real time, IoT millisecond work.