FlashGenius Logo FlashGenius
Databricks DEA Exam Prep · Topic 1 of 5

Databricks Intelligence Platform & Delta Lake

Architecture · Compute Types · Delta Lake · Time Travel · Liquid Clustering · Unity Catalog

The Databricks Data Intelligence Platform unifies data engineering, analytics, and AI on a single lakehouse architecture. Understanding the platform's architecture, compute model, and Delta Lake storage foundation is essential for the DEA exam.

Core Concepts

The Lakehouse Architecture

Combines the best of data lakes (low-cost storage, flexibility) and data warehouses (ACID transactions, performance, governance). Built on Delta Lake + Unity Catalog. One platform for ETL, BI, ML, and AI workloads.

Delta Lake as the Foundation

Open-source storage layer that brings ACID transactions, scalable metadata handling, and unified batch/streaming to cloud object storage (S3, ADLS, GCS). All Databricks tables are Delta tables by default.

Unity Catalog

Unified governance layer for all data and AI assets โ€” tables, volumes, models, functions. Provides fine-grained access control, data lineage, and auditing across the entire Databricks workspace. Three-level namespace: catalog.schema.table.

Databricks Data Intelligence Platform

Integrates Lakeflow (orchestration), Delta Lake (storage), Unity Catalog (governance), and Databricks Mosaic AI (ML/AI) into a single platform. Exam focuses on the data engineering components.

Compute Types

Cluster Type Use Case Key Characteristics Cost Model
All-Purpose Cluster Interactive development, notebooks, ad hoc analysis Manual start/stop; persistent; multi-user; supports multiple languages Higher cost; charged while running
Job Cluster Scheduled/automated jobs and pipelines Auto-created at job start, terminated at end; isolated per run Lower cost; charged only during job run
Serverless Compute SQL analytics, notebooks (preview), Lakeflow Jobs No cluster management; instant start; scales automatically Per-query/per-second billing; no idle cost
High-Concurrency Multiple simultaneous SQL users Optimized for concurrent queries; auto-scaling; supports table ACLs Shared resource; cost-efficient for BI
Single Node Dev/test, small datasets, ML model training No Spark; runs driver only; ideal for pandas/sklearn workflows Minimal cost; no worker nodes

Compute Selection

All-Purpose vs Job Cluster

All-purpose = interactive development and shared use (costs money while idle). Job cluster = automated pipelines (created fresh per run, cheaper, isolated). Best practice: develop on all-purpose, productionize on job clusters.

Serverless Advantages

No cluster configuration needed, instant startup (no warm-up time), automatic scaling, pay only for compute used (per-second billing). Ideal for variable/unpredictable workloads and reducing idle costs.

Databricks Runtime (DBR)

Pre-configured environment with Apache Spark, Delta Lake, and optimized libraries. DBR version pins Spark version + optimizations. ML Runtime adds ML libraries (MLflow, scikit-learn, TensorFlow). Choose DBR version intentionally for reproducibility.

Auto-Scaling

Clusters automatically add/remove worker nodes based on workload demand. Min and max workers configured. Reduces cost during low-demand periods. Not recommended for streaming workloads (scale-down can disrupt streams).

Workspace Components

Databricks Workspace

Web-based UI for notebooks, clusters, jobs, data explorer, and repos. Organized by workspace (one per deployment). Supports multiple users with role-based access.

DBFS (Databricks File System)

Distributed filesystem abstraction over cloud object storage. dbfs:/ prefix. Deprecated for most use cases in favor of Unity Catalog Volumes. Still used for legacy workflows and cluster libraries.

Git Folders (formerly Repos)

Connect Databricks workspace to Git providers (GitHub, GitLab, Azure DevOps). Enable version control for notebooks and code. Supports branching, commits, PRs directly from workspace UI.

Delta Lake Architecture

Transaction Log (_delta_log)

The heart of Delta Lake. Every write to a Delta table creates a new JSON commit entry in _delta_log/. Each entry records exactly what changed (add/remove files, schema changes). This log enables ACID transactions, time travel, and concurrent reads/writes.

ACID Transactions

Atomicity (all or nothing), Consistency (data always valid), Isolation (concurrent writes don't interfere), Durability (committed data persists). Delta Lake provides ACID guarantees that plain Parquet/CSV cannot.

Parquet + Transaction Log = Delta

Delta tables store data as Parquet files in cloud storage, with the _delta_log tracking all changes. Reading a Delta table = reading the latest snapshot (current live Parquet files as recorded by the log).

Schema Enforcement

Delta Lake rejects writes that don't match the table schema by default (schema enforcement = ON). Prevents corrupt or malformed data from silently entering your table. Raises an AnalysisException on schema mismatch.

Delta Lake Key Features

Feature What It Does Key Command/Config
Time Travel Query historical versions of a table VERSION AS OF n or TIMESTAMP AS OF 'date'
MERGE INTO Upsert: insert new + update/delete existing MERGE INTO target USING source ON condition
Schema Evolution Allow new columns to be added on write option("mergeSchema", "true")
Table Clones Copy table structure with or without data CREATE TABLE clone SHALLOW/DEEP CLONE source
Change Data Feed Track row-level changes (insert/update/delete) delta.enableChangeDataFeed = true
Constraints Enforce data quality rules on write ALTER TABLE ADD CONSTRAINT name CHECK (expr)

Managed vs External Tables

Managed Tables

Databricks manages both metadata AND data files. Dropping the table drops the data. Stored in Unity Catalog's managed storage location. Default for most use cases โ€” simpler lifecycle management.

External Tables

Databricks manages only metadata; data files live in a user-specified location (e.g., s3://my-bucket/path). Dropping the table does NOT drop the data. Use when data must persist beyond table lifecycle or is shared with other systems.

OPTIMIZE, ZORDER & VACUUM

OPTIMIZE

Compacts small Parquet files into larger files (default ~1GB each). Reduces the "small file problem" that degrades query performance. Run periodically on frequently written tables: OPTIMIZE table_name.

ZORDER

Co-locates related data in the same files to minimize data scanned. OPTIMIZE table ZORDER BY (column). Best for high-cardinality filter columns (e.g., user_id, date). Only effective after OPTIMIZE โ€” they run together: OPTIMIZE table ZORDER BY (col).

VACUUM

Removes old data files no longer referenced by the transaction log. Default retention = 7 days. VACUUM table RETAIN 168 HOURS. WARNING: Running VACUUM with retention < 7 days breaks time travel for vacuumed versions. Do not use RETAIN 0 HOURS in production.

Liquid Clustering

Liquid Clustering (replaces partitioning + ZORDER)

Next-generation data layout optimization. Automatically clusters data based on specified columns without rigid partition directories. More flexible than traditional partitioning โ€” can change clustering columns without rewriting the table.

Liquid Clustering vs Partitioning

Traditional partitioning creates fixed directory structure (risky with high-cardinality columns โ€” too many small partitions). Liquid Clustering = no rigid directories, handles high-cardinality columns, supports incremental clustering. Enable with: CLUSTER BY (col1, col2) at table creation.

Predictive Optimization

Databricks automatically runs OPTIMIZE and VACUUM on Unity Catalog managed tables based on usage patterns โ€” no manual scheduling needed. Enabled by default on Unity Catalog. Removes operational burden of table maintenance.

Time Travel Operations

Operation Syntax Use Case
Query by version SELECT * FROM table VERSION AS OF 5 Audit specific write
Query by timestamp SELECT * FROM table TIMESTAMP AS OF '2024-01-01' Reproduce historical state
Restore table RESTORE TABLE table TO VERSION AS OF 3 Roll back bad writes
View history DESCRIBE HISTORY table See all commits and operations

Practice Quiz โ€” 10 Questions

1. A data engineer needs to run ad hoc SQL queries interactively throughout the day on Delta tables. Multiple analysts will share the compute resource simultaneously. Which cluster type is MOST appropriate?

2. What does the Delta Lake transaction log (_delta_log) enable?

3. A team wants to query a Delta table as it existed two weeks ago for audit purposes. Which command is correct?

4. What is the key difference between a managed table and an external table in Unity Catalog?

5. A data engineer wants to improve query performance on a Delta table that is frequently filtered by customer_id. Which operation should they run?

6. What does running VACUUM on a Delta table do?

7. Which of the following BEST describes Liquid Clustering compared to traditional Hive-style partitioning?

8. A data engineer is developing a new ETL pipeline. They want to iterate quickly with interactive notebooks, then schedule the pipeline to run nightly. What is the BEST compute strategy?

9. What does schema enforcement do in Delta Lake?

10. A team enables Predictive Optimization on their Unity Catalog managed tables. What does this mean operationally?

Quiz Complete!

Memory Hooks โ€” 6 Cards

๐Ÿ—๏ธ

Lakehouse = Lake + Warehouse

Hook: "Delta Lake gives a data lake a business suit"

Raw cheap storage (lake) + ACID transactions + governance (warehouse) = Lakehouse. Delta Lake is the layer that makes the difference.

๐Ÿ“‹

Transaction Log = Table's Diary

Hook: "Every write gets a diary entry in _delta_log"

Every Delta table write creates a JSON commit in _delta_log/. This log enables ACID, time travel, and concurrent access โ€” it IS Delta Lake's brain.

๐Ÿ•ฐ๏ธ

Time Travel Syntax

Hook: "VERSION or TIMESTAMP โ€” pick your time machine"

VERSION AS OF n for specific commit. TIMESTAMP AS OF 'date' for point-in-time. RESTORE TABLE to roll back. DESCRIBE HISTORY to see all versions.

๐Ÿ—‚๏ธ

OPTIMIZE + ZORDER = File Gym

Hook: "OPTIMIZE packs files; ZORDER sorts what's inside"

OPTIMIZE compacts small files โ†’ fewer, larger Parquet files. ZORDER co-locates related rows โ†’ less data scanned per query. Run together: OPTIMIZE table ZORDER BY (col).

๐Ÿ’ง

Liquid Clustering vs Partitioning

Hook: "Partitioning is a rigid filing cabinet; Liquid Clustering is a smart search engine"

Partitioning = fixed directory structure (bad for high cardinality). Liquid Clustering = flexible, changeable, no small-file risk. Enable with CLUSTER BY.

๐Ÿ›๏ธ

Unity Catalog 3-Level Namespace

Hook: "Catalog โ†’ Schema โ†’ Table (like Country โ†’ City โ†’ Street)"

catalog.schema.table โ€” every object lives in this hierarchy. Unity Catalog governs all: tables, volumes, models, functions, with lineage and fine-grained access control.

Flashcards โ€” Click to Flip

Flashcard 1

What are the 4 compute types and when to use each?

Click to reveal answer

All-Purpose: interactive dev (idle cost). Job Cluster: automated pipelines (cheapest for prod). Serverless: instant start, no config, pay-per-use. High-Concurrency: multiple simultaneous SQL users. Single-Node: dev/test, no Spark needed.

Flashcard 2

What is the Delta Lake transaction log and what does it enable?

Click to reveal answer

_delta_log/ directory with JSON commit files for every write. Enables: ACID transactions, time travel (query historical versions), concurrent reads/writes, schema enforcement, and table history.

Flashcard 3

OPTIMIZE vs ZORDER vs VACUUM โ€” what does each do?

Click to reveal answer

OPTIMIZE: compacts small files into larger Parquet files. ZORDER: co-locates related rows by column value for faster filtering (run WITH OPTIMIZE). VACUUM: deletes old unreferenced data files (default 7-day retention).

Flashcard 4

Managed table vs external table in Unity Catalog

Click to reveal answer

Managed: Databricks manages data AND metadata. DROP TABLE = data gone. External: Databricks manages metadata only; data stays in your storage. DROP TABLE = data stays. Use external when data must outlive the table.

Flashcard 5

Delta Lake time travel โ€” 3 ways to use it

Click to reveal answer

1. Query: SELECT * FROM t VERSION AS OF 5 or TIMESTAMP AS OF 'date'. 2. Restore: RESTORE TABLE t TO VERSION AS OF 3. 3. History: DESCRIBE HISTORY t โ€” see all commits, operations, users.

Flashcard 6

What is Liquid Clustering and how does it differ from partitioning?

Click to reveal answer

Liquid Clustering: flexible data layout using CLUSTER BY (col). No rigid directories, handles high cardinality, columns can change without rewriting. Partitioning: fixed directory structure, bad for high-cardinality, creates small-file problem.

Flashcard 7

What is schema enforcement vs schema evolution in Delta Lake?

Click to reveal answer

Schema Enforcement (default ON): rejects writes that don't match table schema โ€” prevents silent data corruption. Schema Evolution: allows new columns on write via option('mergeSchema','true') โ€” adds columns instead of rejecting.

Flashcard 8

What does Predictive Optimization do in Unity Catalog?

Click to reveal answer

Automatically runs OPTIMIZE and VACUUM on managed Unity Catalog tables based on usage patterns. Eliminates manual scheduling of table maintenance. Enabled by default on Unity Catalog managed tables.

Study Advisor

Beginner Start here if new to Databricks
Start with the Lakehouse concept (lake + warehouse = Delta Lake + Unity Catalog). Then learn the 4 cluster types: All-Purpose (dev), Job Cluster (prod pipelines), Serverless (instant/cheap), High-Concurrency (multi-user SQL).
Intermediate Build deeper Delta Lake knowledge
Master Delta Lake: transaction log enables ACID + time travel. Know OPTIMIZE (compact files) vs ZORDER (co-locate data) vs VACUUM (clean old files). Understand managed vs external tables.
Advanced Tackle nuanced topics
Focus on Liquid Clustering vs traditional partitioning, schema enforcement vs evolution, Change Data Feed, table constraints, RESTORE TABLE, and Predictive Optimization behavior.
Exam Focus High-yield exam topics
High-yield: Job cluster = cheapest for production; _delta_log = ACID + time travel; VACUUM = removes unreferenced files (7-day default); Managed table drop = data deleted; OPTIMIZE ZORDER BY = run together; Liquid Clustering = no rigid partitions.
Quick Review Last-minute summary
Cluster types: All-Purpose=dev/idle cost, Job=prod/cheap, Serverless=instant/pay-per-use, High-Concurrency=multi-user SQL; Delta: OPTIMIZE=compact, ZORDER=co-locate, VACUUM=clean; Time travel: VERSION/TIMESTAMP AS OF; Managed=data deleted on drop; Liquid Clustering=flexible/no small files.