Databricks DEA: Intelligence Platform & Delta Lake

The Databricks Data Intelligence Platform unifies data engineering, analytics, and AI on a single lakehouse architecture. Understanding the platform's architecture, compute model, and Delta Lake storage foundation is essential for the DEA exam.

Core Concepts

The Lakehouse Architecture

Combines the best of data lakes (low-cost storage, flexibility) and data warehouses (ACID transactions, performance, governance). Built on Delta Lake + Unity Catalog. One platform for ETL, BI, ML, and AI workloads.

Delta Lake as the Foundation

Open-source storage layer that brings ACID transactions, scalable metadata handling, and unified batch/streaming to cloud object storage (S3, ADLS, GCS). All Databricks tables are Delta tables by default.

Unity Catalog

Unified governance layer for all data and AI assets — tables, volumes, models, functions. Provides fine-grained access control, data lineage, and auditing across the entire Databricks workspace. Three-level namespace: catalog.schema.table.

Databricks Data Intelligence Platform

Integrates Lakeflow (orchestration), Delta Lake (storage), Unity Catalog (governance), and Databricks Mosaic AI (ML/AI) into a single platform. Exam focuses on the data engineering components.

Compute Types

Cluster Type	Use Case	Key Characteristics	Cost Model
All-Purpose Cluster	Interactive development, notebooks, ad hoc analysis	Manual start/stop; persistent; multi-user; supports multiple languages	Higher cost; charged while running
Job Cluster	Scheduled/automated jobs and pipelines	Auto-created at job start, terminated at end; isolated per run	Lower cost; charged only during job run
Serverless Compute	SQL analytics, notebooks (preview), Lakeflow Jobs	No cluster management; instant start; scales automatically	Per-query/per-second billing; no idle cost
High-Concurrency	Multiple simultaneous SQL users	Optimized for concurrent queries; auto-scaling; supports table ACLs	Shared resource; cost-efficient for BI
Single Node	Dev/test, small datasets, ML model training	No Spark; runs driver only; ideal for pandas/sklearn workflows	Minimal cost; no worker nodes

Compute Selection

All-Purpose vs Job Cluster

All-purpose = interactive development and shared use (costs money while idle). Job cluster = automated pipelines (created fresh per run, cheaper, isolated). Best practice: develop on all-purpose, productionize on job clusters.

Serverless Advantages

No cluster configuration needed, instant startup (no warm-up time), automatic scaling, pay only for compute used (per-second billing). Ideal for variable/unpredictable workloads and reducing idle costs.

Databricks Runtime (DBR)

Pre-configured environment with Apache Spark, Delta Lake, and optimized libraries. DBR version pins Spark version + optimizations. ML Runtime adds ML libraries (MLflow, scikit-learn, TensorFlow). Choose DBR version intentionally for reproducibility.

Auto-Scaling

Clusters automatically add/remove worker nodes based on workload demand. Min and max workers configured. Reduces cost during low-demand periods. Not recommended for streaming workloads (scale-down can disrupt streams).

Workspace Components

Databricks Workspace

Web-based UI for notebooks, clusters, jobs, data explorer, and repos. Organized by workspace (one per deployment). Supports multiple users with role-based access.

DBFS (Databricks File System)

Distributed filesystem abstraction over cloud object storage. dbfs:/ prefix. Deprecated for most use cases in favor of Unity Catalog Volumes. Still used for legacy workflows and cluster libraries.

Git Folders (formerly Repos)

Connect Databricks workspace to Git providers (GitHub, GitLab, Azure DevOps). Enable version control for notebooks and code. Supports branching, commits, PRs directly from workspace UI.

Delta Lake Architecture

Transaction Log (`_delta_log`)

The heart of Delta Lake. Every write to a Delta table creates a new JSON commit entry in _delta_log/. Each entry records exactly what changed (add/remove files, schema changes). This log enables ACID transactions, time travel, and concurrent reads/writes.

ACID Transactions

Atomicity (all or nothing), Consistency (data always valid), Isolation (concurrent writes don't interfere), Durability (committed data persists). Delta Lake provides ACID guarantees that plain Parquet/CSV cannot.

Parquet + Transaction Log = Delta

Delta tables store data as Parquet files in cloud storage, with the _delta_log tracking all changes. Reading a Delta table = reading the latest snapshot (current live Parquet files as recorded by the log).

Schema Enforcement

Delta Lake rejects writes that don't match the table schema by default (schema enforcement = ON). Prevents corrupt or malformed data from silently entering your table. Raises an AnalysisException on schema mismatch.

Delta Lake Key Features

Feature	What It Does	Key Command/Config
Time Travel	Query historical versions of a table	`VERSION AS OF n` or `TIMESTAMP AS OF 'date'`
MERGE INTO	Upsert: insert new + update/delete existing	`MERGE INTO target USING source ON condition`
Schema Evolution	Allow new columns to be added on write	`option("mergeSchema", "true")`
Table Clones	Copy table structure with or without data	`CREATE TABLE clone SHALLOW/DEEP CLONE source`
Change Data Feed	Track row-level changes (insert/update/delete)	`delta.enableChangeDataFeed = true`
Constraints	Enforce data quality rules on write	`ALTER TABLE ADD CONSTRAINT name CHECK (expr)`

Managed vs External Tables

Managed Tables

Databricks manages both metadata AND data files. Dropping the table drops the data. Stored in Unity Catalog's managed storage location. Default for most use cases — simpler lifecycle management.

External Tables

Databricks manages only metadata; data files live in a user-specified location (e.g., s3://my-bucket/path). Dropping the table does NOT drop the data. Use when data must persist beyond table lifecycle or is shared with other systems.

OPTIMIZE, ZORDER & VACUUM

OPTIMIZE

Compacts small Parquet files into larger files (default ~1GB each). Reduces the "small file problem" that degrades query performance. Run periodically on frequently written tables: OPTIMIZE table_name.

ZORDER

Co-locates related data in the same files to minimize data scanned. OPTIMIZE table ZORDER BY (column). Best for high-cardinality filter columns (e.g., user_id, date). Only effective after OPTIMIZE — they run together: OPTIMIZE table ZORDER BY (col).

VACUUM

Removes old data files no longer referenced by the transaction log. Default retention = 7 days. VACUUM table RETAIN 168 HOURS. WARNING: Running VACUUM with retention < 7 days breaks time travel for vacuumed versions. Do not use RETAIN 0 HOURS in production.

Liquid Clustering

Liquid Clustering (replaces partitioning + ZORDER)

Next-generation data layout optimization. Automatically clusters data based on specified columns without rigid partition directories. More flexible than traditional partitioning — can change clustering columns without rewriting the table.

Liquid Clustering vs Partitioning

Traditional partitioning creates fixed directory structure (risky with high-cardinality columns — too many small partitions). Liquid Clustering = no rigid directories, handles high-cardinality columns, supports incremental clustering. Enable with: CLUSTER BY (col1, col2) at table creation.

Predictive Optimization

Databricks automatically runs OPTIMIZE and VACUUM on Unity Catalog managed tables based on usage patterns — no manual scheduling needed. Enabled by default on Unity Catalog. Removes operational burden of table maintenance.

Time Travel Operations

Operation	Syntax	Use Case
Query by version	`SELECT * FROM table VERSION AS OF 5`	Audit specific write
Query by timestamp	`SELECT * FROM table TIMESTAMP AS OF '2024-01-01'`	Reproduce historical state
Restore table	`RESTORE TABLE table TO VERSION AS OF 3`	Roll back bad writes
View history	`DESCRIBE HISTORY table`	See all commits and operations

Memory Hooks — 6 Cards

🏗️

Lakehouse = Lake + Warehouse

Hook: "Delta Lake gives a data lake a business suit"

Raw cheap storage (lake) + ACID transactions + governance (warehouse) = Lakehouse. Delta Lake is the layer that makes the difference.

📋

Transaction Log = Table's Diary

Hook: "Every write gets a diary entry in _delta_log"

Every Delta table write creates a JSON commit in _delta_log/. This log enables ACID, time travel, and concurrent access — it IS Delta Lake's brain.

🕰️

Time Travel Syntax

Hook: "VERSION or TIMESTAMP — pick your time machine"

VERSION AS OF n for specific commit. TIMESTAMP AS OF 'date' for point-in-time. RESTORE TABLE to roll back. DESCRIBE HISTORY to see all versions.

🗂️

OPTIMIZE + ZORDER = File Gym

Hook: "OPTIMIZE packs files; ZORDER sorts what's inside"

OPTIMIZE compacts small files → fewer, larger Parquet files. ZORDER co-locates related rows → less data scanned per query. Run together: OPTIMIZE table ZORDER BY (col).

💧

Liquid Clustering vs Partitioning

Hook: "Partitioning is a rigid filing cabinet; Liquid Clustering is a smart search engine"

Partitioning = fixed directory structure (bad for high cardinality). Liquid Clustering = flexible, changeable, no small-file risk. Enable with CLUSTER BY.

🏛️

Unity Catalog 3-Level Namespace

Hook: "Catalog → Schema → Table (like Country → City → Street)"

catalog.schema.table — every object lives in this hierarchy. Unity Catalog governs all: tables, volumes, models, functions, with lineage and fine-grained access control.

Flashcards — Click to Flip

Flashcard 1

What are the 4 compute types and when to use each?

Click to reveal answer

All-Purpose: interactive dev (idle cost). Job Cluster: automated pipelines (cheapest for prod). Serverless: instant start, no config, pay-per-use. High-Concurrency: multiple simultaneous SQL users. Single-Node: dev/test, no Spark needed.

Flashcard 2

What is the Delta Lake transaction log and what does it enable?

Click to reveal answer

_delta_log/ directory with JSON commit files for every write. Enables: ACID transactions, time travel (query historical versions), concurrent reads/writes, schema enforcement, and table history.

Flashcard 3

OPTIMIZE vs ZORDER vs VACUUM — what does each do?

Click to reveal answer

OPTIMIZE: compacts small files into larger Parquet files. ZORDER: co-locates related rows by column value for faster filtering (run WITH OPTIMIZE). VACUUM: deletes old unreferenced data files (default 7-day retention).

Flashcard 4

Managed table vs external table in Unity Catalog

Click to reveal answer

Managed: Databricks manages data AND metadata. DROP TABLE = data gone. External: Databricks manages metadata only; data stays in your storage. DROP TABLE = data stays. Use external when data must outlive the table.

Flashcard 5

Delta Lake time travel — 3 ways to use it

Click to reveal answer

1. Query: SELECT * FROM t VERSION AS OF 5 or TIMESTAMP AS OF 'date'. 2. Restore: RESTORE TABLE t TO VERSION AS OF 3. 3. History: DESCRIBE HISTORY t — see all commits, operations, users.

Flashcard 6

What is Liquid Clustering and how does it differ from partitioning?

Click to reveal answer

Liquid Clustering: flexible data layout using CLUSTER BY (col). No rigid directories, handles high cardinality, columns can change without rewriting. Partitioning: fixed directory structure, bad for high-cardinality, creates small-file problem.

Flashcard 7

What is schema enforcement vs schema evolution in Delta Lake?

Click to reveal answer

Schema Enforcement (default ON): rejects writes that don't match table schema — prevents silent data corruption. Schema Evolution: allows new columns on write via option('mergeSchema','true') — adds columns instead of rejecting.

Flashcard 8

What does Predictive Optimization do in Unity Catalog?

Click to reveal answer

Automatically runs OPTIMIZE and VACUUM on managed Unity Catalog tables based on usage patterns. Eliminates manual scheduling of table maintenance. Enabled by default on Unity Catalog managed tables.

Study Advisor

Beginner Start here if new to Databricks ▼

Start with the Lakehouse concept (lake + warehouse = Delta Lake + Unity Catalog). Then learn the 4 cluster types: All-Purpose (dev), Job Cluster (prod pipelines), Serverless (instant/cheap), High-Concurrency (multi-user SQL).

Intermediate Build deeper Delta Lake knowledge ▼

Master Delta Lake: transaction log enables ACID + time travel. Know OPTIMIZE (compact files) vs ZORDER (co-locate data) vs VACUUM (clean old files). Understand managed vs external tables.

Advanced Tackle nuanced topics ▼

Focus on Liquid Clustering vs traditional partitioning, schema enforcement vs evolution, Change Data Feed, table constraints, RESTORE TABLE, and Predictive Optimization behavior.

Exam Focus High-yield exam topics ▼

High-yield: Job cluster = cheapest for production; _delta_log = ACID + time travel; VACUUM = removes unreferenced files (7-day default); Managed table drop = data deleted; OPTIMIZE ZORDER BY = run together; Liquid Clustering = no rigid partitions.

Quick Review Last-minute summary ▼

Cluster types: All-Purpose=dev/idle cost, Job=prod/cheap, Serverless=instant/pay-per-use, High-Concurrency=multi-user SQL; Delta: OPTIMIZE=compact, ZORDER=co-locate, VACUUM=clean; Time travel: VERSION/TIMESTAMP AS OF; Managed=data deleted on drop; Liquid Clustering=flexible/no small files.

Core Concepts

The Lakehouse Architecture

Delta Lake as the Foundation

Unity Catalog

Databricks Data Intelligence Platform

Compute Types

Compute Selection

All-Purpose vs Job Cluster

Serverless Advantages

Databricks Runtime (DBR)

Auto-Scaling

Workspace Components

Databricks Workspace

DBFS (Databricks File System)

Git Folders (formerly Repos)

Delta Lake Architecture

Transaction Log (_delta_log)

ACID Transactions

Parquet + Transaction Log = Delta

Schema Enforcement

Delta Lake Key Features

Managed vs External Tables

Managed Tables

External Tables

OPTIMIZE, ZORDER & VACUUM

OPTIMIZE

ZORDER

VACUUM

Liquid Clustering

Liquid Clustering (replaces partitioning + ZORDER)

Liquid Clustering vs Partitioning

Predictive Optimization

Time Travel Operations

Practice Quiz — 10 Questions

Quiz Complete!

Memory Hooks — 6 Cards

Lakehouse = Lake + Warehouse

Transaction Log = Table's Diary

Time Travel Syntax

OPTIMIZE + ZORDER = File Gym

Liquid Clustering vs Partitioning

Unity Catalog 3-Level Namespace

Flashcards — Click to Flip

Study Advisor

Transaction Log (`_delta_log`)