Understanding the Key Factors Driving Up Costs on the Lakehouse

If you’re using Databricks, you’re leveraging the power of Apache Spark to process massive datasets in a unified lakehouse platform. But while Spark is incredibly powerful, its distributed nature can lead to unexpected costs. A small change in your code or cluster configuration can drastically increase your bill.

So, what are the key factors that would make very similar Databricks jobs or notebooks cost more? Let’s break down the hidden drivers of your Databricks bill.


1. Cluster Sizing and Configuration

This is the most direct lever for cost control. The size, type, and utilization of your clusters directly determine your hourly bill.

  • Under-provisioning vs. Over-provisioning: An undersized cluster will run jobs slowly, leading to high-cost hours. An oversized cluster will have idle resources, leading to wasted spend. The goal is to find the sweet spot. You can learn more about Cluster configurations in the documentation.
  • Instance Types: Databricks allows you to choose from various cloud instance types (e.g., standard, memory-optimized, compute-optimized). Using a general-purpose instance for a memory-intensive task (like sorting a large dataset) can be less efficient than using a memory-optimized one, leading to longer runtime and higher costs.
  • Photon vs. OSS Spark: Databricks’ Photon engine is designed for high-performance data processing. A job running on a Photon-enabled cluster will often execute much faster than on a standard Spark cluster, potentially reducing the total cluster runtime and overall cost, even if the per-hour cost is slightly higher. Read more about the Photon engine.

Takeaway: A job that runs for 10 minutes on a large, expensive cluster can cost more than a job that runs for 30 minutes on a smaller, cheaper one. The key is to optimize for total job runtime x cluster cost per hour.


2. Data Skew and Shuffling

Databricks (Spark) distributes data and computation across workers. Data shuffling—the process of redistributing data across the cluster for operations like JOIN, GROUP BY, and DISTINCT—is one of the most expensive operations.

  • Data Skew: This is a major cost driver. If one partition of your data is much larger than others (e.g., a few user IDs are responsible for a huge percentage of events), one worker will be overloaded while others sit idle. This “skewed” worker becomes a bottleneck, forcing the entire cluster to wait and leading to longer runtime and higher costs.
  • Wide vs. Narrow Transformations: Operations like filter() and select() are narrow transformations that don’t require shuffling. Operations like groupBy(), join(), and orderBy() are wide transformations that do. Minimizing unnecessary wide transformations can significantly reduce shuffle costs.

Contrasting Example: Handling Data Skew

Let’s imagine a dataset of web events where 90% of the traffic comes from a single “bot” user (user_id = 'bot_user'), while the rest of the traffic is from millions of other users.

High-Cost Query (Inefficient GROUP BY on a Skewed Key):

Python

# The DataFrame 'events_df' is not optimized for the skewed 'user_id'
events_df.groupBy("user_id").count().show()

Expected Outcome: The groupBy operation will trigger a shuffle. Since almost all the data for bot_user goes to a single worker, that worker will be overloaded with data and compute, while other workers finish quickly and sit idle. This leads to a stalled job and a long runtime, driving up costs.

Low-Cost Query (Mitigating Skew with a Two-Stage Aggregation):

Python

from pyspark.sql.functions import md5, when, col, concat, lit, rand, regexp_replace

# Add a random prefix to the skewed key to distribute it across partitions.
# This is a common technique to 'salt' the key.
salted_events_df = events_df.withColumn(
    "salted_user_id",
    when(col("user_id") == "bot_user", concat(col("user_id"), lit("_"), md5(rand()))).otherwise(col("user_id"))
)

# Perform a partial aggregation on the salted key
partial_agg = salted_events_df.groupBy("salted_user_id").count()

# Perform a final aggregation on the original key to get the correct counts
final_agg = partial_agg.groupBy(regexp_replace(col("salted_user_id"), r"_.*", "")).sum("count")
final_agg.show()

Expected Outcome: By “salting” the skewed key, the data for bot_user is distributed across multiple partitions and workers during the first aggregation stage. This parallelizes the work, preventing a bottleneck. The job finishes much faster, saving money on cluster runtime. For more information on handling skew, refer to the Databricks best practices for joins.


3. I/O and Data Format Efficiency

The amount of data read from and written to storage (e.g., S3, ADLS, GCS) is a key cost factor.

  • File Format: Processing data stored in inefficient formats like JSON or CSV is much more expensive than using columnar formats like Delta Lake or Parquet. Columnar formats allow Spark to push down predicates (filters) and read only the necessary columns and rows.
  • Predicate Pushdown: Similar to BigQuery’s column pruning, Spark can optimize queries by only reading the necessary data from storage. This works best with formats like Delta Lake and Parquet, especially when combined with partitioning and Z-ordering.
  • Small Files: Having a huge number of tiny files (a “small file problem”) can overwhelm the cluster’s driver, leading to massive metadata overhead and slower query times. Compacting small files into larger ones can dramatically improve I/O efficiency.

Contrasting Example: Reading Unoptimized vs. Optimized Data

Let’s compare reading a 1 TB dataset from cloud storage in two different formats.

High-Cost Query (Reading a Flat, Unpartitioned CSV):

Python

# The data is a single 1 TB CSV file in a folder
df_csv = spark.read.format("csv").load("/data/web_logs/logs.csv")
df_csv.filter(col("event_timestamp") >= "2025-06-20").select("user_id", "event_name").show()

Expected Outcome: Spark must read the entire 1 TB file from storage into memory to apply the filter. This is a full scan, taking a long time and costing a lot in I/O and compute time.

Low-Cost Query (Reading a Delta Lake Table with Partitioning):

Python

# The data is a Delta table partitioned by `event_timestamp`
df_delta = spark.read.format("delta").load("/data/delta_web_logs/events_by_day")
df_delta.filter(col("event_timestamp") >= "2025-06-20").select("user_id", "event_name").show()

Expected Outcome: The Databricks runtime’s optimizer will inspect the table’s metadata and see that it is partitioned. It will only read the files from the partitions that match the filter condition (>= 2025-06-20). It also performs column pruning, reading only the user_id and event_name columns from those files. This could reduce the data scanned from 1 TB to just a few GBs. Learn more about Delta Lake best practices.


4. Views: Understanding the Cost of Abstraction

In Databricks, views are a crucial tool for code reusability and data governance, but it’s vital to understand their impact on cost.

  • Standard Views: These are “logical views” or “virtual tables.” When you query a standard view, Databricks does not perform any computation or store any data. It simply expands the view’s definition to the underlying query and then executes that query. This means that a standard view does not save on compute costs; the underlying query is re-executed every time you query the view.
  • Materialized Views / Streaming Tables: These are different. They are physical tables that store the pre-computed results of a query. They are automatically and incrementally updated as the source data changes. Queries against a materialized view or streaming table read from the pre-computed result, not the raw data, leading to much faster queries and significant cost savings. This is perfect for dashboards or reports that require fresh data with minimal query latency.

Takeaway: Use standard views for logical abstraction and security, but rely on materialized views or streaming tables for cost savings on frequently accessed, expensive queries. Learn how to create and use views and materialized views.


5. Cluster Management and Lifecycle

How you manage your clusters has a direct impact on your bill.

  • Autoscaling and Autotermination: These are your best friends. Autoscaling allows a cluster to grow or shrink based on workload demands, preventing over-provisioning. Autotermination automatically shuts down an idle cluster after a specified period (e.g., 10 minutes), preventing you from paying for compute that isn’t being used.
  • Job Clusters vs. All-Purpose Clusters: All-purpose clusters are great for interactive development. For production workloads, use job clusters, which are automatically provisioned for a specific job and terminated once it’s complete. You only pay for what you use, and job clusters are often cheaper per DBU (Databricks Unit) than all-purpose clusters. You can read more about Databricks clusters.

6. Repair and Rerun Failed Tasks

When a multi-task job fails, re-running the entire workflow from the beginning is a huge waste of compute resources. Databricks offers a “Repair and Rerun” feature to address this.

  • How it works: This feature allows you to re-run only the failed and skipped tasks in a multi-task job, along with their dependent tasks. Tasks that already completed successfully are skipped.
  • Cost Savings: By avoiding the re-computation of successful tasks, you save significant time and compute costs. This is a critical feature for building robust and cost-efficient data and ML pipelines.

You can learn more about this feature in the official Databricks documentation on Troubleshoot and repair job failures and a Databricks blog post on How to Save Time and Money on Data and ML Workflows With “Repair and Rerun”.


7. Code Optimization and Lazy Evaluation

Spark uses lazy evaluation, meaning it doesn’t execute any transformations until an action (like count(), show(), or write()) is called. The Catalyst Optimizer builds a logical and then a physical plan before execution. An inefficient transformation early in the chain can create a bottleneck later on.

Contrasting Example: Caching and Repeated Computations

Let’s say you have a DataFrame (transactions_df) that requires a complex join and aggregation, and you need to use the result of this computation multiple times in your notebook.

High-Cost Query (Repeated Computation):

Python

# Join a large transactions table with a products table
enriched_transactions_df = transactions_df.join(products_df, on="product_id")

# First computation: total sales by region
enriched_transactions_df.groupBy("region").sum("sales").show()

# Second computation: average sales by product category
enriched_transactions_df.groupBy("product_category").avg("sales").show()

Expected Outcome: Because of lazy evaluation, the enriched_transactions_df logic (the join) will be re-executed twice, once for each action (show()). The cluster will perform the expensive join operation a second time, wasting compute cycles and increasing runtime.

Low-Cost Query (Using cache()):

Python

# Join a large transactions table with a products table and cache the result
enriched_transactions_df = transactions_df.join(products_df, on="product_id").cache()
enriched_transactions_df.count() # Trigger caching with a simple action

# First computation: total sales by region (uses cached DataFrame)
enriched_transactions_df.groupBy("region").sum("sales").show()

# Second computation: average sales by product category (uses cached DataFrame)
enriched_transactions_df.groupBy("product_category").avg("sales").show()

Expected Outcome: By calling .cache(), you tell Spark to persist the DataFrame in memory (or on disk if memory is insufficient) after the first computation. Subsequent actions will read from the cache, avoiding re-computation and saving significant time and cost. You can learn more about caching data in the Databricks documentation.


The Bottom Line: Optimize for Efficiency

Databricks cost optimization is about maximizing the efficiency of your distributed compute. By focusing on smart cluster sizing, efficient data formats, minimizing shuffling, leveraging Databricks’ built-in management features, and orchestrating jobs efficiently, you can significantly reduce your cloud bill.

What other optimization techniques have you used to save money on Databricks? Share your insights in the comments!