Databricks DEA: Lakeflow Jobs & CI/CD

Lakeflow Jobs is Databricks' native pipeline orchestration system. The DEA exam also covers CI/CD using Git Folders and Automation Bundles (DABs) for promoting code across environments. These are increasingly important for production data engineering.

Lakeflow Jobs (formerly Databricks Jobs)

Databricks-native workflow orchestration. Define multi-task pipelines using a DAG (Directed Acyclic Graph) of tasks with dependencies. Supports notebooks, SQL queries, Python scripts, DLT pipelines, and more. Replaces external schedulers like Airflow for many Databricks-native workflows.

Why Lakeflow Jobs Over Manual Scheduling

Automated retries, dependency tracking, alerting, run history, and cluster lifecycle management (job clusters auto-create and terminate). One platform for pipeline orchestration and execution — no separate orchestration tool needed for pure Databricks workflows.

Automation Bundles (DABs)

Declarative configuration files (YAML) for packaging and deploying Databricks workspace assets (Jobs, DLT pipelines, notebooks, permissions). Enables Infrastructure-as-Code for Databricks. Supports environment-specific configuration (dev/test/prod targets with variable overrides).

CI/CD in Databricks

Full software development lifecycle within Databricks: code in Git Folders → commit and push → PR review → deploy via DABs. Databricks CLI validates and deploys bundles. Enables reproducible, version-controlled, environment-promoted data pipelines.

Job Structure

Job = Collection of Tasks with Dependencies

A Lakeflow Job defines tasks, their configurations, dependencies (which task must complete before another starts), compute resources, and scheduling. Tasks form a DAG — visualized in the Jobs UI as a task graph.

Task Dependencies

Set depends_on to define execution order. Tasks with no dependencies run first (in parallel if multiple). Downstream tasks start only after all upstream tasks complete successfully. Failed upstream task = downstream tasks skipped or fail depending on configuration.

Job Clusters vs All-Purpose Clusters

Job clusters (recommended for production): created fresh for each job run, terminated when done — no idle cost, isolated execution, repeatable. All-purpose clusters: shared, always running — convenient for dev but expensive for scheduled jobs. Always use job clusters in production.

Alerts and Notifications

Configure email or webhook alerts on job start, success, failure, or SLA miss. Set up at the job level or task level. Critical for monitoring production pipelines. Integrate with PagerDuty, Slack, or email via webhook.

Job Trigger Types

Trigger Type	Description	Use Case
Scheduled (time-based)	Cron expression or UI schedule (e.g., every hour, daily at 8AM)	Time-predictable workloads; daily ETL runs
File Arrival	Job triggers when new files land in a specified cloud storage path	Event-driven ingestion pipelines
Table Update	Job triggers when a Unity Catalog table's data is updated	Data-driven dependencies between pipelines
Manual	Triggered via UI or API on demand	Testing, ad hoc runs, debugging
Continuous	Runs continuously with minimal gap between runs	Near-real-time processing without full streaming

Trigger Selection

Time-Based vs Data-Driven Triggers

Time-based (scheduled cron): predictable, simple, but may run when no new data exists. Data-driven (file arrival or table update): more efficient — only runs when there's actually new data to process. Use data-driven triggers when upstream data arrival is unpredictable.

File Arrival Trigger

Monitors a specified cloud storage path for new files. Job runs automatically when files arrive. Eliminates the need to poll or schedule unnecessarily. Ideal for event-driven ingestion workflows where files are pushed by upstream systems.

Table Update Trigger

Monitors a Unity Catalog table for data changes. Job runs when the monitored table receives new data. Enables pipeline chaining within Databricks without external orchestration — downstream jobs react to upstream job completions.

Task Types

Task Type	What It Runs	When to Use
Notebook	Databricks notebook (.py, .sql, .scala, .r)	Interactive-style code, exploration, prototyping
SQL Query	SQL statement or saved query	Pure SQL transformations, analytics
Dashboard	Refreshes a Databricks dashboard	Scheduled report updates
Pipeline (DLT)	Delta Live Tables pipeline	DLT-based ingestion or transformation pipelines
Python Script	.py file from Git Folder or Volumes	Modular Python code, libraries
dbt	dbt project task	dbt model runs within Databricks
Spark Submit	spark-submit style job	Legacy Spark applications

Control Flow

Retries

Configure automatic retries on task failure. Set max_retries and min_retry_interval_millis. Useful for transient failures (network timeouts, temporary API errors). After all retries exhausted, task is marked as failed and downstream tasks react accordingly.

Branching (If/Else Conditions)

Conditional task execution based on the outcome of a previous task. Run if condition: task runs only if upstream task succeeded, failed, or always. Example: run a cleanup task only if the main task failed. Enables error-handling branches in the DAG.

Looping (For Each Task)

Iterate over a list of inputs and run a sub-task for each item in parallel or sequentially. Example: process a list of tables or files, running the same notebook for each. Defined with a for_each task type wrapping a sub-task configuration.

Run Job Task

A task that triggers another complete Lakeflow Job. Enables modular pipeline composition — break complex workflows into smaller, reusable jobs and orchestrate them from a parent job.

DAG-Based Task Graph

DAG Visualization in Jobs UI

The Lakeflow Jobs UI displays tasks and their dependencies as a visual DAG. Each task shows its status (running, succeeded, failed, skipped). Click a task to see its logs, run duration, and cluster details. Use the DAG view to identify upstream blockers causing downstream failures.

Run History and Monitoring

The Jobs UI run history shows past job runs with duration, status, and task-level details. Compare current run duration against historical baseline to detect performance degradation. Filter by status (failed runs) to investigate recurring failures.

Git Folders

Git Folders (formerly Databricks Repos)

Connect the Databricks workspace directly to a Git provider (GitHub, GitLab, Bitbucket, Azure DevOps). Notebooks and Python files in Git Folders are version-controlled. Enables branching, committing, pushing, and pull request workflows directly from the Databricks UI.

Branching Workflow in Git Folders

Create a new branch for each feature/fix: git checkout -b feature/new-pipeline. Make changes in notebooks. Commit via the Git Folders UI (add commit message). Push to remote. Create a pull request in GitHub/GitLab for code review. Merge to main after approval.

Committing and Pushing from Workspace

In the Databricks workspace, open the Git Folder → click the Git status indicator → stage changes → add commit message → commit and push. Changes are immediately reflected in the remote Git repository. Supports collaborative development with conflict detection.

Automation Bundles (DABs)

Automation Bundles (formerly Databricks Asset Bundles / DABs)

YAML-based declarative configuration for Databricks workspace assets. Define Jobs, DLT Pipelines, cluster configurations, permissions, and variable overrides in databricks.yml. Version-controlled in Git. Deploy the same codebase across dev/test/prod using target-specific overrides.

Bundle Structure

databricks.yml (root config), resources/ folder (job definitions, pipeline configs), src/ folder (notebooks, Python files). Bundle defines targets (dev, staging, prod) each with environment-specific variable values (cluster sizes, catalog names, schedule settings).

Variable Overrides per Environment

Define variables in the bundle that take different values per target. Example: catalog_name = dev_catalog in dev, prod_catalog in prod. Same pipeline code runs in all environments — only configuration differs. Eliminates environment-specific code branches.

Deploying Bundles

databricks bundle validate — checks for configuration errors without deploying. databricks bundle deploy --target dev — deploys resources to the dev workspace. databricks bundle run job_name --target prod — runs a specific job in prod. databricks bundle destroy — removes deployed resources.

Databricks CLI

Command-line tool for interacting with Databricks workspaces. Install: pip install databricks-cli or pip install databricks-sdk. Authenticate with personal access token or OAuth. Key commands for DEA: databricks bundle validate, databricks bundle deploy, databricks bundle run, databricks jobs run-now.

CLI in CI/CD Pipelines

In GitHub Actions, GitLab CI, or Azure DevOps: databricks bundle deploy --target $TARGET on merge to main. Automates promotion: code merged to dev branch → deploys to dev workspace; merged to main → deploys to prod. Bundles + CLI = repeatable, automated deployment pipeline.

Your Score

📅 3 Trigger Types

"Time, File, or Table — pick your starting gun"

Scheduled = cron/time-based. File Arrival = fires when files land in storage. Table Update = fires when UC table data changes. Time-based = simple; Data-driven = efficient.

🔀 DAG = Dependencies in Order

"Task A done → Task B starts → Task C runs"

Lakeflow Jobs use a DAG (Directed Acyclic Graph). Upstream must complete before downstream starts. Failed upstream = downstream skipped. Visualized in Jobs UI.

🌿 Git Folders Workflow

"Branch → Edit → Commit → Push → PR"

Create branch → edit notebooks → commit with message → push to Git provider → create PR → review → merge to main. Full Git workflow inside Databricks workspace UI.

📦 DABs = IaC for Databricks

"YAML in Git = repeatable Databricks deployments"

Automation Bundles define Jobs/Pipelines/Permissions in databricks.yml. Variable overrides per target (dev/test/prod). Deploy with databricks bundle deploy --target prod. Same code, different configs per environment.

🔁 Control Flow: Retry, Branch, Loop

"Retry = try again; Branch = if/else; Loop = for each"

Retries: auto-retry on transient failures. Branching: run task only if upstream succeeded/failed. For Each: iterate same task over a list of inputs in parallel or sequentially.

💻 CLI Commands to Know

"Validate → Deploy → Run"

databricks bundle validate — check config. databricks bundle deploy --target X — deploy to environment. databricks bundle run job_name — execute. Used in CI/CD pipelines for automated promotion.

Flashcards

Flashcard 1

3 Lakeflow Job trigger types and when to use each

Click to reveal answer

Answer

Scheduled: cron expression — predictable time-based loads. File Arrival: triggers when files land in cloud storage — event-driven ingestion. Table Update: triggers when Unity Catalog table receives new data — data-driven pipeline chaining. Choose data-driven over time-based when data arrival is unpredictable.

Flashcard 2

Lakeflow Jobs: job cluster vs all-purpose cluster

Click to reveal answer

Answer

Job cluster (use in production): auto-created at job start, terminated when done. No idle cost. Isolated per run. All-purpose cluster: persistent, shared, always costs money while running. Use all-purpose for development; job clusters for production scheduled workflows.

Flashcard 3

What are Automation Bundles (DABs) and what do they enable?

Click to reveal answer

Answer

YAML-based declarative config for Databricks workspace assets (Jobs, Pipelines, Clusters, Permissions). Stored in Git. Deploy same codebase across dev/test/prod with environment-specific variable overrides. Commands: validate, deploy --target X, run, destroy.

Flashcard 4

Git Folders: what actions can be done from the Databricks UI?

Click to reveal answer

Answer

Create and switch branches. Edit notebooks with version control. Stage changes. Commit with message. Push to remote (GitHub/GitLab/Azure DevOps). Create pull requests. Compare branches. Resolve basic conflicts. Full Git workflow without leaving the Databricks workspace.

Flashcard 5

Branching vs For Each task in Lakeflow Jobs

Click to reveal answer

Answer

Branching (If/Else): conditional task execution based on upstream task status (succeeded/failed/always). Example: cleanup task runs only if main task failed. For Each: iterate same sub-task over a list of inputs (parallel or sequential). Example: process 50 tables with same notebook.

Flashcard 6

What does databricks bundle validate do?

Click to reveal answer

Answer

Validates the Automation Bundle configuration for syntax errors, missing references, and configuration issues WITHOUT deploying any resources. Safe to run anytime. Use in CI pipelines as a pre-deployment check before databricks bundle deploy --target prod.

Flashcard 7

Task types available in Lakeflow Jobs

Click to reveal answer

Answer

Notebook (Python/SQL/Scala/R). SQL Query (saved SQL). Dashboard (refresh). Pipeline (DLT). Python Script (from Git Folder or Volumes). dbt (dbt project). Spark Submit (legacy). Run Job (trigger another job). For Each (iterate sub-task over a list).

Flashcard 8

How does the Table Update trigger enable pipeline chaining?

Click to reveal answer

Answer

A downstream Lakeflow Job monitors a Unity Catalog table for data changes. When an upstream job writes new data to that table, the downstream job automatically triggers. Enables native Databricks pipeline orchestration without external tools — upstream completion signals downstream start via table update.

Study Advisor

📚 Beginner ▼

Start with Lakeflow Job structure (DAG of tasks with dependencies). Learn the 3 trigger types: Scheduled (cron), File Arrival (event-driven), Table Update (data-driven). Understand job cluster vs all-purpose (job cluster = cheaper, isolated, for production).

📈 Intermediate ▼

Study task types (Notebook, SQL, DLT Pipeline, For Each, Run Job). Learn control flow: retries (transient failures), branching (if/else on task status), For Each (iterate over inputs). Understand Git Folders workflow (branch → edit → commit → push → PR).

⚡ Advanced ▼

Master Automation Bundles (DABs): YAML structure, target-specific variable overrides, deploy/validate/run CLI commands. Understand how DABs enable environment promotion (dev → staging → prod) with the same codebase and CI/CD automation.

🎯 Exam Focus ▼

High-yield: File Arrival trigger = event-driven; Table Update trigger = data-driven chaining; Failed upstream = downstream skipped; Job cluster = create/terminate per run (no idle cost); DABs = YAML IaC; bundle validate = no deployment; bundle deploy --target = environment promotion.

⚡ Quick Review ▼

Triggers: Scheduled/File Arrival/Table Update. Job cluster=prod, all-purpose=dev. Task types: Notebook/SQL/Dashboard/DLT/ForEach. Control flow: Retries/Branching/ForEach. Git: branch→commit→push→PR. DABs: databricks.yml + targets. CLI: validate→deploy→run.