Core Concepts
The Lakehouse Architecture
Combines the best of data lakes (low-cost storage, flexibility) and data warehouses (ACID transactions, performance, governance). Built on Delta Lake + Unity Catalog. One platform for ETL, BI, ML, and AI workloads.
Delta Lake as the Foundation
Open-source storage layer that brings ACID transactions, scalable metadata handling, and unified batch/streaming to cloud object storage (S3, ADLS, GCS). All Databricks tables are Delta tables by default.
Unity Catalog
Unified governance layer for all data and AI assets โ tables, volumes, models, functions. Provides fine-grained access control, data lineage, and auditing across the entire Databricks workspace. Three-level namespace: catalog.schema.table.
Databricks Data Intelligence Platform
Integrates Lakeflow (orchestration), Delta Lake (storage), Unity Catalog (governance), and Databricks Mosaic AI (ML/AI) into a single platform. Exam focuses on the data engineering components.
Compute Types
| Cluster Type | Use Case | Key Characteristics | Cost Model |
|---|---|---|---|
| All-Purpose Cluster | Interactive development, notebooks, ad hoc analysis | Manual start/stop; persistent; multi-user; supports multiple languages | Higher cost; charged while running |
| Job Cluster | Scheduled/automated jobs and pipelines | Auto-created at job start, terminated at end; isolated per run | Lower cost; charged only during job run |
| Serverless Compute | SQL analytics, notebooks (preview), Lakeflow Jobs | No cluster management; instant start; scales automatically | Per-query/per-second billing; no idle cost |
| High-Concurrency | Multiple simultaneous SQL users | Optimized for concurrent queries; auto-scaling; supports table ACLs | Shared resource; cost-efficient for BI |
| Single Node | Dev/test, small datasets, ML model training | No Spark; runs driver only; ideal for pandas/sklearn workflows | Minimal cost; no worker nodes |
Compute Selection
All-Purpose vs Job Cluster
All-purpose = interactive development and shared use (costs money while idle). Job cluster = automated pipelines (created fresh per run, cheaper, isolated). Best practice: develop on all-purpose, productionize on job clusters.
Serverless Advantages
No cluster configuration needed, instant startup (no warm-up time), automatic scaling, pay only for compute used (per-second billing). Ideal for variable/unpredictable workloads and reducing idle costs.
Databricks Runtime (DBR)
Pre-configured environment with Apache Spark, Delta Lake, and optimized libraries. DBR version pins Spark version + optimizations. ML Runtime adds ML libraries (MLflow, scikit-learn, TensorFlow). Choose DBR version intentionally for reproducibility.
Auto-Scaling
Clusters automatically add/remove worker nodes based on workload demand. Min and max workers configured. Reduces cost during low-demand periods. Not recommended for streaming workloads (scale-down can disrupt streams).
Workspace Components
Databricks Workspace
Web-based UI for notebooks, clusters, jobs, data explorer, and repos. Organized by workspace (one per deployment). Supports multiple users with role-based access.
DBFS (Databricks File System)
Distributed filesystem abstraction over cloud object storage. dbfs:/ prefix. Deprecated for most use cases in favor of Unity Catalog Volumes. Still used for legacy workflows and cluster libraries.
Git Folders (formerly Repos)
Connect Databricks workspace to Git providers (GitHub, GitLab, Azure DevOps). Enable version control for notebooks and code. Supports branching, commits, PRs directly from workspace UI.
Delta Lake Architecture
Transaction Log (_delta_log)
The heart of Delta Lake. Every write to a Delta table creates a new JSON commit entry in _delta_log/. Each entry records exactly what changed (add/remove files, schema changes). This log enables ACID transactions, time travel, and concurrent reads/writes.
ACID Transactions
Atomicity (all or nothing), Consistency (data always valid), Isolation (concurrent writes don't interfere), Durability (committed data persists). Delta Lake provides ACID guarantees that plain Parquet/CSV cannot.
Parquet + Transaction Log = Delta
Delta tables store data as Parquet files in cloud storage, with the _delta_log tracking all changes. Reading a Delta table = reading the latest snapshot (current live Parquet files as recorded by the log).
Schema Enforcement
Delta Lake rejects writes that don't match the table schema by default (schema enforcement = ON). Prevents corrupt or malformed data from silently entering your table. Raises an AnalysisException on schema mismatch.
Delta Lake Key Features
| Feature | What It Does | Key Command/Config |
|---|---|---|
| Time Travel | Query historical versions of a table | VERSION AS OF n or TIMESTAMP AS OF 'date' |
| MERGE INTO | Upsert: insert new + update/delete existing | MERGE INTO target USING source ON condition |
| Schema Evolution | Allow new columns to be added on write | option("mergeSchema", "true") |
| Table Clones | Copy table structure with or without data | CREATE TABLE clone SHALLOW/DEEP CLONE source |
| Change Data Feed | Track row-level changes (insert/update/delete) | delta.enableChangeDataFeed = true |
| Constraints | Enforce data quality rules on write | ALTER TABLE ADD CONSTRAINT name CHECK (expr) |
Managed vs External Tables
Managed Tables
Databricks manages both metadata AND data files. Dropping the table drops the data. Stored in Unity Catalog's managed storage location. Default for most use cases โ simpler lifecycle management.
External Tables
Databricks manages only metadata; data files live in a user-specified location (e.g., s3://my-bucket/path). Dropping the table does NOT drop the data. Use when data must persist beyond table lifecycle or is shared with other systems.
OPTIMIZE, ZORDER & VACUUM
OPTIMIZE
Compacts small Parquet files into larger files (default ~1GB each). Reduces the "small file problem" that degrades query performance. Run periodically on frequently written tables: OPTIMIZE table_name.
ZORDER
Co-locates related data in the same files to minimize data scanned. OPTIMIZE table ZORDER BY (column). Best for high-cardinality filter columns (e.g., user_id, date). Only effective after OPTIMIZE โ they run together: OPTIMIZE table ZORDER BY (col).
VACUUM
Removes old data files no longer referenced by the transaction log. Default retention = 7 days. VACUUM table RETAIN 168 HOURS. WARNING: Running VACUUM with retention < 7 days breaks time travel for vacuumed versions. Do not use RETAIN 0 HOURS in production.
Liquid Clustering
Liquid Clustering (replaces partitioning + ZORDER)
Next-generation data layout optimization. Automatically clusters data based on specified columns without rigid partition directories. More flexible than traditional partitioning โ can change clustering columns without rewriting the table.
Liquid Clustering vs Partitioning
Traditional partitioning creates fixed directory structure (risky with high-cardinality columns โ too many small partitions). Liquid Clustering = no rigid directories, handles high-cardinality columns, supports incremental clustering. Enable with: CLUSTER BY (col1, col2) at table creation.
Predictive Optimization
Databricks automatically runs OPTIMIZE and VACUUM on Unity Catalog managed tables based on usage patterns โ no manual scheduling needed. Enabled by default on Unity Catalog. Removes operational burden of table maintenance.
Time Travel Operations
| Operation | Syntax | Use Case |
|---|---|---|
| Query by version | SELECT * FROM table VERSION AS OF 5 |
Audit specific write |
| Query by timestamp | SELECT * FROM table TIMESTAMP AS OF '2024-01-01' |
Reproduce historical state |
| Restore table | RESTORE TABLE table TO VERSION AS OF 3 |
Roll back bad writes |
| View history | DESCRIBE HISTORY table |
See all commits and operations |
Practice Quiz โ 10 Questions
1. A data engineer needs to run ad hoc SQL queries interactively throughout the day on Delta tables. Multiple analysts will share the compute resource simultaneously. Which cluster type is MOST appropriate?
2. What does the Delta Lake transaction log (_delta_log) enable?
3. A team wants to query a Delta table as it existed two weeks ago for audit purposes. Which command is correct?
4. What is the key difference between a managed table and an external table in Unity Catalog?
5. A data engineer wants to improve query performance on a Delta table that is frequently filtered by customer_id. Which operation should they run?
6. What does running VACUUM on a Delta table do?
7. Which of the following BEST describes Liquid Clustering compared to traditional Hive-style partitioning?
8. A data engineer is developing a new ETL pipeline. They want to iterate quickly with interactive notebooks, then schedule the pipeline to run nightly. What is the BEST compute strategy?
9. What does schema enforcement do in Delta Lake?
10. A team enables Predictive Optimization on their Unity Catalog managed tables. What does this mean operationally?
Quiz Complete!
Memory Hooks โ 6 Cards
Lakehouse = Lake + Warehouse
Raw cheap storage (lake) + ACID transactions + governance (warehouse) = Lakehouse. Delta Lake is the layer that makes the difference.
Transaction Log = Table's Diary
_delta_log"Every Delta table write creates a JSON commit in _delta_log/. This log enables ACID, time travel, and concurrent access โ it IS Delta Lake's brain.
Time Travel Syntax
VERSION AS OF n for specific commit. TIMESTAMP AS OF 'date' for point-in-time. RESTORE TABLE to roll back. DESCRIBE HISTORY to see all versions.
OPTIMIZE + ZORDER = File Gym
OPTIMIZE compacts small files โ fewer, larger Parquet files. ZORDER co-locates related rows โ less data scanned per query. Run together: OPTIMIZE table ZORDER BY (col).
Liquid Clustering vs Partitioning
Partitioning = fixed directory structure (bad for high cardinality). Liquid Clustering = flexible, changeable, no small-file risk. Enable with CLUSTER BY.
Unity Catalog 3-Level Namespace
catalog.schema.table โ every object lives in this hierarchy. Unity Catalog governs all: tables, volumes, models, functions, with lineage and fine-grained access control.
Flashcards โ Click to Flip
What are the 4 compute types and when to use each?
Click to reveal answer
All-Purpose: interactive dev (idle cost). Job Cluster: automated pipelines (cheapest for prod). Serverless: instant start, no config, pay-per-use. High-Concurrency: multiple simultaneous SQL users. Single-Node: dev/test, no Spark needed.
What is the Delta Lake transaction log and what does it enable?
Click to reveal answer
_delta_log/ directory with JSON commit files for every write. Enables: ACID transactions, time travel (query historical versions), concurrent reads/writes, schema enforcement, and table history.
OPTIMIZE vs ZORDER vs VACUUM โ what does each do?
Click to reveal answer
OPTIMIZE: compacts small files into larger Parquet files. ZORDER: co-locates related rows by column value for faster filtering (run WITH OPTIMIZE). VACUUM: deletes old unreferenced data files (default 7-day retention).
Managed table vs external table in Unity Catalog
Click to reveal answer
Managed: Databricks manages data AND metadata. DROP TABLE = data gone. External: Databricks manages metadata only; data stays in your storage. DROP TABLE = data stays. Use external when data must outlive the table.
Delta Lake time travel โ 3 ways to use it
Click to reveal answer
1. Query: SELECT * FROM t VERSION AS OF 5 or TIMESTAMP AS OF 'date'. 2. Restore: RESTORE TABLE t TO VERSION AS OF 3. 3. History: DESCRIBE HISTORY t โ see all commits, operations, users.
What is Liquid Clustering and how does it differ from partitioning?
Click to reveal answer
Liquid Clustering: flexible data layout using CLUSTER BY (col). No rigid directories, handles high cardinality, columns can change without rewriting. Partitioning: fixed directory structure, bad for high-cardinality, creates small-file problem.
What is schema enforcement vs schema evolution in Delta Lake?
Click to reveal answer
Schema Enforcement (default ON): rejects writes that don't match table schema โ prevents silent data corruption. Schema Evolution: allows new columns on write via option('mergeSchema','true') โ adds columns instead of rejecting.
What does Predictive Optimization do in Unity Catalog?
Click to reveal answer
Automatically runs OPTIMIZE and VACUUM on managed Unity Catalog tables based on usage patterns. Eliminates manual scheduling of table maintenance. Enabled by default on Unity Catalog managed tables.
Study Advisor
_delta_log = ACID + time travel; VACUUM = removes unreferenced files (7-day default); Managed table drop = data deleted; OPTIMIZE ZORDER BY = run together; Liquid Clustering = no rigid partitions.