Databricks DEA: Data Ingestion & Loading

The DEA exam covers three primary ingestion patterns: batch loading with COPY INTO, incremental/streaming with Auto Loader, and enterprise source connectivity with Lakeflow Connect. Knowing when to use each is critical.

Three Core Ingestion Patterns

(1) COPY INTO — SQL-based incremental file loading from cloud storage. (2) Auto Loader — scalable file discovery with schema inference and evolution. (3) Lakeflow Connect — managed connectors for enterprise sources (databases, SaaS, ERP). Each serves different source types and scale requirements.

Ingestion Destination: Unity Catalog Tables

All modern Databricks ingestion patterns land data into Unity Catalog-governed Delta tables. This ensures lineage tracking, access control, and auditability from the moment data enters the lakehouse.

Medallion Architecture Context

Ingestion is the entry point to the Bronze layer. Raw data lands in Bronze with minimal transformation. Silver and Gold layers then clean, enrich, and aggregate. Auto Loader and COPY INTO are the primary Bronze ingestion tools.

Semi-Structured Data

JSON files with nested structures, arrays, and varying schemas are common in ingestion. Databricks provides from_json(), schema_of_json(), explode(), and get_json_object() to flatten and extract nested fields into typed Delta table columns.

COPY INTO

SQL command that incrementally loads files from cloud object storage (S3, ADLS, GCS) into a Unity Catalog Delta table. Idempotent — tracks which files have already been loaded and skips them on subsequent runs. Syntax: COPY INTO target_table FROM 'cloud_path' FILEFORMAT = (type) FORMAT_OPTIONS (...)

COPY INTO Key Options

FILEFORMAT: CSV, JSON, PARQUET, AVRO, ORC, TEXT, BINARYFILE. FORMAT_OPTIONS for CSV: header, delimiter, inferSchema. COPY_OPTIONS('mergeSchema' = 'true') for schema evolution. PATTERN to filter specific files by glob pattern.

COPY INTO Idempotency

Maintains an internal file tracking table. Files already loaded are not re-loaded on subsequent runs — safe to re-run on failure. Only processes new files added since last successful run. This makes it ideal for incremental batch ingestion on a schedule.

COPY INTO Limitations

Best for moderate file volumes (thousands, not millions of files). Does not support stateful streaming. For very large file volumes or near-real-time streaming needs, use Auto Loader instead.

COPY INTO vs Auto Loader

Feature	COPY INTO	Auto Loader
Interface	SQL (`COPY INTO`)	Python/Scala (`readStream`)
File discovery	Directory listing	Directory listing OR file notification (event-based)
Scale	Thousands of files	Millions+ of files
Schema inference	Basic	Advanced — infers and evolves schema automatically
Streaming	No — batch/incremental	Yes — Structured Streaming
Checkpointing	Built-in file tracking	Explicit checkpoint location required
Best for	Scheduled batch ingestion	Continuous/near-real-time; large file volumes

Auto Loader

Auto Loader (`cloudFiles`)

Databricks-native file ingestion using Structured Streaming. Uses the cloudFiles format source. Automatically discovers new files, infers schema, and handles schema evolution. spark.readStream.format("cloudFiles").option("cloudFiles.format", "json").load("path")

Auto Loader: Directory Listing vs File Notification

Directory listing (default): Auto Loader lists the directory periodically — works everywhere, no cloud setup needed, but less efficient at scale. File notification: Uses cloud event notifications (SNS+SQS, Azure Event Grid) to instantly detect new files — highly scalable, near-zero latency, requires cloud setup.

Auto Loader Schema Inference & Evolution

Auto Loader infers schema on first run and stores it in a schema location (cloudFiles.schemaLocation). When new columns appear in source data, Auto Loader can evolve the schema automatically (cloudFiles.schemaEvolutionMode = "addNewColumns"). New columns added to Delta table without pipeline failure.

Auto Loader Checkpointing

Required for fault tolerance. The checkpoint location stores progress information — which files processed, current offsets. If the stream restarts, it resumes from the checkpoint without reprocessing. Set via .option("checkpointLocation", "path")

Lakeflow Connect

Databricks-managed ingestion service for enterprise data sources. Replaces manual JDBC pipelines with managed, monitored connectors. Delivers data directly into Unity Catalog-governed Delta tables. Two types: Standard connectors and Managed connectors.

Standard Connectors

Pre-built connectors for common enterprise sources (Salesforce, ServiceNow, SQL Server, MySQL, PostgreSQL, etc.). Configured through the Databricks UI or API. Handle schema mapping, incremental loading, and error recovery automatically.

Managed Connectors

Fully managed by Databricks — Databricks handles the infrastructure, scaling, and reliability. Higher level of abstraction than standard connectors. Ideal for enterprise-scale, production-grade ingestion from SaaS sources.

Lakeflow Connect vs JDBC/ODBC in Notebooks

JDBC/ODBC in notebooks: flexible, code-based, good for one-off loads or custom logic. Lakeflow Connect: managed service, no code for connectivity, built-in monitoring, better for recurring production ingestion. Use JDBC/ODBC when connector doesn't exist; use Lakeflow Connect for supported sources.

JDBC/ODBC Ingestion

JDBC/ODBC in Notebooks

Read directly from relational databases using spark.read.format("jdbc").option("url", ...).option("dbtable", ...).option("user", ...).option("password", ...).load(). Works with any JDBC-compatible source. Orchestrate and schedule with Lakeflow Jobs. Partition the JDBC read for parallelism: option("partitionColumn", "id").option("numPartitions", 8)

Ingestion Method Selection

Scenario	Best Method	Why
Batch CSV files dropped to S3 hourly	COPY INTO	SQL-based, idempotent, scheduled batch
Continuous JSON event stream, millions of files/day	Auto Loader (file notification)	Scales to millions of files, near-real-time
Salesforce CRM data into Delta	Lakeflow Connect (standard)	Managed connector, no custom code
Legacy Oracle database → Delta (no connector)	JDBC in notebook + Lakeflow Job	Custom JDBC when no managed connector exists
Real-time database CDC (change data capture)	Lakeflow Connect (managed)	Managed CDC with reliability guarantees
Semi-structured nested JSON, schema may change	Auto Loader + schema evolution	Handles schema changes automatically
One-time historical backfill, known schema	COPY INTO	Simple, SQL-based, idempotent re-runs safe

Schema Enforcement & Evolution

Schema Enforcement (Auto Loader)

By default, Auto Loader enforces the inferred schema — files with unexpected columns cause the stream to fail. This protects the Delta table from unexpected schema changes. cloudFiles.schemaEvolutionMode = "failOnNewColumns" (default behavior).

Schema Evolution (Auto Loader)

Set cloudFiles.schemaEvolutionMode = "addNewColumns" to automatically add new columns to the target Delta table when they appear in source files. The pipeline restarts once to apply the schema change, then continues. New columns are nullable.

Schema Inference Location

Auto Loader stores the inferred schema at cloudFiles.schemaLocation. This must be set for schema inference to work across restarts. Typically set to a path within the checkpoint directory: .option("cloudFiles.schemaLocation", checkpointPath + "/schema")

Semi-Structured & Unstructured Data

Nested JSON Handling

from_json(col, schema) — parse a JSON string column into a struct. schema_of_json(sample) — infer schema from a sample JSON string. col.field — access nested struct fields. explode(array_col) — convert array elements into separate rows.

Ingesting Unstructured Data

Databricks supports BINARYFILE format in COPY INTO for raw binary files (images, PDFs). Use Unity Catalog Volumes to store unstructured files alongside structured tables. AI/ML pipelines can then process these files from Volumes.

out of 10 questions correct

📋 COPY INTO = SQL Idempotent Batch

"COPY INTO skips what it's already seen"

SQL command. Tracks loaded files internally. Re-runnable safely. Best for scheduled batch ingestion of thousands of files. Not for streaming.

🔄 Auto Loader = Scalable Streaming Files

"Auto Loader never misses a file arrival"

Uses cloudFiles format. Two modes: directory listing (simple) vs file notification (fast/scalable). Requires checkpoint. Handles millions of files and schema evolution.

🔌 Lakeflow Connect = Enterprise Source Plug-In

"Lakeflow Connect is the adapter rack for enterprise systems"

Managed connectors for Salesforce, ServiceNow, databases. Standard = pre-built UI configuration. Managed = fully Databricks-operated. No custom code for supported sources.

📊 Schema Enforcement vs Evolution

"Enforce = Lock the door; Evolve = Add a new door"

Auto Loader default = enforce (reject new columns). Set schemaEvolutionMode = addNewColumns to evolve (add new columns automatically). Schema stored at schemaLocation.

🗂️ Ingestion Selection Rule

"Files? COPY INTO or Auto Loader. Enterprise? Lakeflow Connect. Custom? JDBC"

Batch files → COPY INTO. Streaming/high-volume files → Auto Loader. SaaS/DB → Lakeflow Connect. Unsupported source → JDBC in notebook.

🏗️ Medallion = Bronze First

"All ingestion lands in Bronze — raw and untouched"

COPY INTO and Auto Loader deliver raw data to Bronze Delta tables. Minimal transformation at ingestion. Clean in Silver. Aggregate in Gold. Ingestion = the Bronze entry point.

Flashcards (click to flip)

Study Advisor