Lakeflow Jobs is Databricks' native pipeline orchestration system. The DEA exam also covers CI/CD using Git Folders and Automation Bundles (DABs) for promoting code across environments. These are increasingly important for production data engineering.
Lakeflow Jobs (formerly Databricks Jobs)
Databricks-native workflow orchestration. Define multi-task pipelines using a DAG (Directed Acyclic Graph) of tasks with dependencies. Supports notebooks, SQL queries, Python scripts, DLT pipelines, and more. Replaces external schedulers like Airflow for many Databricks-native workflows.
Why Lakeflow Jobs Over Manual Scheduling
Automated retries, dependency tracking, alerting, run history, and cluster lifecycle management (job clusters auto-create and terminate). One platform for pipeline orchestration and execution β no separate orchestration tool needed for pure Databricks workflows.
Automation Bundles (DABs)
Declarative configuration files (YAML) for packaging and deploying Databricks workspace assets (Jobs, DLT pipelines, notebooks, permissions). Enables Infrastructure-as-Code for Databricks. Supports environment-specific configuration (dev/test/prod targets with variable overrides).
CI/CD in Databricks
Full software development lifecycle within Databricks: code in Git Folders → commit and push → PR review → deploy via DABs. Databricks CLI validates and deploys bundles. Enables reproducible, version-controlled, environment-promoted data pipelines.
Job = Collection of Tasks with Dependencies
A Lakeflow Job defines tasks, their configurations, dependencies (which task must complete before another starts), compute resources, and scheduling. Tasks form a DAG β visualized in the Jobs UI as a task graph.
Task Dependencies
Set depends_on to define execution order. Tasks with no dependencies run first (in parallel if multiple). Downstream tasks start only after all upstream tasks complete successfully. Failed upstream task = downstream tasks skipped or fail depending on configuration.
Job Clusters vs All-Purpose Clusters
Job clusters (recommended for production): created fresh for each job run, terminated when done β no idle cost, isolated execution, repeatable. All-purpose clusters: shared, always running β convenient for dev but expensive for scheduled jobs. Always use job clusters in production.
Alerts and Notifications
Configure email or webhook alerts on job start, success, failure, or SLA miss. Set up at the job level or task level. Critical for monitoring production pipelines. Integrate with PagerDuty, Slack, or email via webhook.
| Trigger Type | Description | Use Case |
|---|---|---|
| Scheduled (time-based) | Cron expression or UI schedule (e.g., every hour, daily at 8AM) | Time-predictable workloads; daily ETL runs |
| File Arrival | Job triggers when new files land in a specified cloud storage path | Event-driven ingestion pipelines |
| Table Update | Job triggers when a Unity Catalog table's data is updated | Data-driven dependencies between pipelines |
| Manual | Triggered via UI or API on demand | Testing, ad hoc runs, debugging |
| Continuous | Runs continuously with minimal gap between runs | Near-real-time processing without full streaming |
Time-Based vs Data-Driven Triggers
Time-based (scheduled cron): predictable, simple, but may run when no new data exists. Data-driven (file arrival or table update): more efficient β only runs when there's actually new data to process. Use data-driven triggers when upstream data arrival is unpredictable.
File Arrival Trigger
Monitors a specified cloud storage path for new files. Job runs automatically when files arrive. Eliminates the need to poll or schedule unnecessarily. Ideal for event-driven ingestion workflows where files are pushed by upstream systems.
Table Update Trigger
Monitors a Unity Catalog table for data changes. Job runs when the monitored table receives new data. Enables pipeline chaining within Databricks without external orchestration β downstream jobs react to upstream job completions.
| Task Type | What It Runs | When to Use |
|---|---|---|
| Notebook | Databricks notebook (.py, .sql, .scala, .r) | Interactive-style code, exploration, prototyping |
| SQL Query | SQL statement or saved query | Pure SQL transformations, analytics |
| Dashboard | Refreshes a Databricks dashboard | Scheduled report updates |
| Pipeline (DLT) | Delta Live Tables pipeline | DLT-based ingestion or transformation pipelines |
| Python Script | .py file from Git Folder or Volumes | Modular Python code, libraries |
| dbt | dbt project task | dbt model runs within Databricks |
| Spark Submit | spark-submit style job | Legacy Spark applications |
Retries
Configure automatic retries on task failure. Set max_retries and min_retry_interval_millis. Useful for transient failures (network timeouts, temporary API errors). After all retries exhausted, task is marked as failed and downstream tasks react accordingly.
Branching (If/Else Conditions)
Conditional task execution based on the outcome of a previous task. Run if condition: task runs only if upstream task succeeded, failed, or always. Example: run a cleanup task only if the main task failed. Enables error-handling branches in the DAG.
Looping (For Each Task)
Iterate over a list of inputs and run a sub-task for each item in parallel or sequentially. Example: process a list of tables or files, running the same notebook for each. Defined with a for_each task type wrapping a sub-task configuration.
Run Job Task
A task that triggers another complete Lakeflow Job. Enables modular pipeline composition β break complex workflows into smaller, reusable jobs and orchestrate them from a parent job.
DAG Visualization in Jobs UI
The Lakeflow Jobs UI displays tasks and their dependencies as a visual DAG. Each task shows its status (running, succeeded, failed, skipped). Click a task to see its logs, run duration, and cluster details. Use the DAG view to identify upstream blockers causing downstream failures.
Run History and Monitoring
The Jobs UI run history shows past job runs with duration, status, and task-level details. Compare current run duration against historical baseline to detect performance degradation. Filter by status (failed runs) to investigate recurring failures.
Git Folders (formerly Databricks Repos)
Connect the Databricks workspace directly to a Git provider (GitHub, GitLab, Bitbucket, Azure DevOps). Notebooks and Python files in Git Folders are version-controlled. Enables branching, committing, pushing, and pull request workflows directly from the Databricks UI.
Branching Workflow in Git Folders
Create a new branch for each feature/fix: git checkout -b feature/new-pipeline. Make changes in notebooks. Commit via the Git Folders UI (add commit message). Push to remote. Create a pull request in GitHub/GitLab for code review. Merge to main after approval.
Committing and Pushing from Workspace
In the Databricks workspace, open the Git Folder → click the Git status indicator → stage changes → add commit message → commit and push. Changes are immediately reflected in the remote Git repository. Supports collaborative development with conflict detection.
Automation Bundles (formerly Databricks Asset Bundles / DABs)
YAML-based declarative configuration for Databricks workspace assets. Define Jobs, DLT Pipelines, cluster configurations, permissions, and variable overrides in databricks.yml. Version-controlled in Git. Deploy the same codebase across dev/test/prod using target-specific overrides.
Bundle Structure
databricks.yml (root config), resources/ folder (job definitions, pipeline configs), src/ folder (notebooks, Python files). Bundle defines targets (dev, staging, prod) each with environment-specific variable values (cluster sizes, catalog names, schedule settings).
Variable Overrides per Environment
Define variables in the bundle that take different values per target. Example: catalog_name = dev_catalog in dev, prod_catalog in prod. Same pipeline code runs in all environments β only configuration differs. Eliminates environment-specific code branches.
Deploying Bundles
databricks bundle validate β checks for configuration errors without deploying. databricks bundle deploy --target dev β deploys resources to the dev workspace. databricks bundle run job_name --target prod β runs a specific job in prod. databricks bundle destroy β removes deployed resources.
Databricks CLI
Command-line tool for interacting with Databricks workspaces. Install: pip install databricks-cli or pip install databricks-sdk. Authenticate with personal access token or OAuth. Key commands for DEA: databricks bundle validate, databricks bundle deploy, databricks bundle run, databricks jobs run-now.
CLI in CI/CD Pipelines
In GitHub Actions, GitLab CI, or Azure DevOps: databricks bundle deploy --target $TARGET on merge to main. Automates promotion: code merged to dev branch → deploys to dev workspace; merged to main → deploys to prod. Bundles + CLI = repeatable, automated deployment pipeline.
databricks.yml. Variable overrides per target (dev/test/prod). Deploy with databricks bundle deploy --target prod. Same code, different configs per environment.databricks bundle validate β check config. databricks bundle deploy --target X β deploy to environment. databricks bundle run job_name β execute. Used in CI/CD pipelines for automated promotion.validate, deploy --target X, run, destroy.databricks bundle validate do?databricks bundle deploy --target prod.bundle validate = no deployment; bundle deploy --target = environment promotion.
databricks.yml + targets. CLI: validate→deploy→run.