The DEA exam covers three primary ingestion patterns: batch loading with COPY INTO, incremental/streaming with Auto Loader, and enterprise source connectivity with Lakeflow Connect. Knowing when to use each is critical.
Three Core Ingestion Patterns
(1) COPY INTO โ SQL-based incremental file loading from cloud storage. (2) Auto Loader โ scalable file discovery with schema inference and evolution. (3) Lakeflow Connect โ managed connectors for enterprise sources (databases, SaaS, ERP). Each serves different source types and scale requirements.
Ingestion Destination: Unity Catalog Tables
All modern Databricks ingestion patterns land data into Unity Catalog-governed Delta tables. This ensures lineage tracking, access control, and auditability from the moment data enters the lakehouse.
Medallion Architecture Context
Ingestion is the entry point to the Bronze layer. Raw data lands in Bronze with minimal transformation. Silver and Gold layers then clean, enrich, and aggregate. Auto Loader and COPY INTO are the primary Bronze ingestion tools.
Semi-Structured Data
JSON files with nested structures, arrays, and varying schemas are common in ingestion. Databricks provides from_json(), schema_of_json(), explode(), and get_json_object() to flatten and extract nested fields into typed Delta table columns.
COPY INTO
SQL command that incrementally loads files from cloud object storage (S3, ADLS, GCS) into a Unity Catalog Delta table. Idempotent โ tracks which files have already been loaded and skips them on subsequent runs. Syntax: COPY INTO target_table FROM 'cloud_path' FILEFORMAT = (type) FORMAT_OPTIONS (...)
COPY INTO Key Options
FILEFORMAT: CSV, JSON, PARQUET, AVRO, ORC, TEXT, BINARYFILE. FORMAT_OPTIONS for CSV: header, delimiter, inferSchema. COPY_OPTIONS('mergeSchema' = 'true') for schema evolution. PATTERN to filter specific files by glob pattern.
COPY INTO Idempotency
Maintains an internal file tracking table. Files already loaded are not re-loaded on subsequent runs โ safe to re-run on failure. Only processes new files added since last successful run. This makes it ideal for incremental batch ingestion on a schedule.
COPY INTO Limitations
Best for moderate file volumes (thousands, not millions of files). Does not support stateful streaming. For very large file volumes or near-real-time streaming needs, use Auto Loader instead.
| Feature | COPY INTO | Auto Loader |
|---|---|---|
| Interface | SQL (COPY INTO) | Python/Scala (readStream) |
| File discovery | Directory listing | Directory listing OR file notification (event-based) |
| Scale | Thousands of files | Millions+ of files |
| Schema inference | Basic | Advanced โ infers and evolves schema automatically |
| Streaming | No โ batch/incremental | Yes โ Structured Streaming |
| Checkpointing | Built-in file tracking | Explicit checkpoint location required |
| Best for | Scheduled batch ingestion | Continuous/near-real-time; large file volumes |
Auto Loader (cloudFiles)
Databricks-native file ingestion using Structured Streaming. Uses the cloudFiles format source. Automatically discovers new files, infers schema, and handles schema evolution. spark.readStream.format("cloudFiles").option("cloudFiles.format", "json").load("path")
Auto Loader: Directory Listing vs File Notification
Directory listing (default): Auto Loader lists the directory periodically โ works everywhere, no cloud setup needed, but less efficient at scale. File notification: Uses cloud event notifications (SNS+SQS, Azure Event Grid) to instantly detect new files โ highly scalable, near-zero latency, requires cloud setup.
Auto Loader Schema Inference & Evolution
Auto Loader infers schema on first run and stores it in a schema location (cloudFiles.schemaLocation). When new columns appear in source data, Auto Loader can evolve the schema automatically (cloudFiles.schemaEvolutionMode = "addNewColumns"). New columns added to Delta table without pipeline failure.
Auto Loader Checkpointing
Required for fault tolerance. The checkpoint location stores progress information โ which files processed, current offsets. If the stream restarts, it resumes from the checkpoint without reprocessing. Set via .option("checkpointLocation", "path")
Lakeflow Connect
Databricks-managed ingestion service for enterprise data sources. Replaces manual JDBC pipelines with managed, monitored connectors. Delivers data directly into Unity Catalog-governed Delta tables. Two types: Standard connectors and Managed connectors.
Standard Connectors
Pre-built connectors for common enterprise sources (Salesforce, ServiceNow, SQL Server, MySQL, PostgreSQL, etc.). Configured through the Databricks UI or API. Handle schema mapping, incremental loading, and error recovery automatically.
Managed Connectors
Fully managed by Databricks โ Databricks handles the infrastructure, scaling, and reliability. Higher level of abstraction than standard connectors. Ideal for enterprise-scale, production-grade ingestion from SaaS sources.
Lakeflow Connect vs JDBC/ODBC in Notebooks
JDBC/ODBC in notebooks: flexible, code-based, good for one-off loads or custom logic. Lakeflow Connect: managed service, no code for connectivity, built-in monitoring, better for recurring production ingestion. Use JDBC/ODBC when connector doesn't exist; use Lakeflow Connect for supported sources.
JDBC/ODBC in Notebooks
Read directly from relational databases using spark.read.format("jdbc").option("url", ...).option("dbtable", ...).option("user", ...).option("password", ...).load(). Works with any JDBC-compatible source. Orchestrate and schedule with Lakeflow Jobs. Partition the JDBC read for parallelism: option("partitionColumn", "id").option("numPartitions", 8)
| Scenario | Best Method | Why |
|---|---|---|
| Batch CSV files dropped to S3 hourly | COPY INTO | SQL-based, idempotent, scheduled batch |
| Continuous JSON event stream, millions of files/day | Auto Loader (file notification) | Scales to millions of files, near-real-time |
| Salesforce CRM data into Delta | Lakeflow Connect (standard) | Managed connector, no custom code |
| Legacy Oracle database โ Delta (no connector) | JDBC in notebook + Lakeflow Job | Custom JDBC when no managed connector exists |
| Real-time database CDC (change data capture) | Lakeflow Connect (managed) | Managed CDC with reliability guarantees |
| Semi-structured nested JSON, schema may change | Auto Loader + schema evolution | Handles schema changes automatically |
| One-time historical backfill, known schema | COPY INTO | Simple, SQL-based, idempotent re-runs safe |
Schema Enforcement (Auto Loader)
By default, Auto Loader enforces the inferred schema โ files with unexpected columns cause the stream to fail. This protects the Delta table from unexpected schema changes. cloudFiles.schemaEvolutionMode = "failOnNewColumns" (default behavior).
Schema Evolution (Auto Loader)
Set cloudFiles.schemaEvolutionMode = "addNewColumns" to automatically add new columns to the target Delta table when they appear in source files. The pipeline restarts once to apply the schema change, then continues. New columns are nullable.
Schema Inference Location
Auto Loader stores the inferred schema at cloudFiles.schemaLocation. This must be set for schema inference to work across restarts. Typically set to a path within the checkpoint directory: .option("cloudFiles.schemaLocation", checkpointPath + "/schema")
Nested JSON Handling
from_json(col, schema) โ parse a JSON string column into a struct. schema_of_json(sample) โ infer schema from a sample JSON string. col.field โ access nested struct fields. explode(array_col) โ convert array elements into separate rows.
Ingesting Unstructured Data
Databricks supports BINARYFILE format in COPY INTO for raw binary files (images, PDFs). Use Unity Catalog Volumes to store unstructured files alongside structured tables. AI/ML pipelines can then process these files from Volumes.
cloudFiles format. Two modes: directory listing (simple) vs file notification (fast/scalable). Requires checkpoint. Handles millions of files and schema evolution.schemaEvolutionMode = addNewColumns to evolve (add new columns automatically). Schema stored at schemaLocation.