Stop Paying for Distributed Frameworks You Don’t Need

Server Scale by Alejandro Colocho

Key Takeaways

Many data teams default to distributed frameworks like Apache Spark for workloads that could run efficiently on a single machine, which often leads to unnecessary infrastructure cost and operational complexity.
Distributed clusters introduce overhead such as network communication, data serialization, shuffle coordination, and cluster management that single-machine processing avoids.
Modern tools like DuckDB and Polars have closed the historical gap between single-machine and distributed processing by supporting parallel execution and out-of-core workloads.
For many batch workloads under tens of terabytes, a single large machine can process data efficiently without the cost or complexity of a distributed cluster.
Distributed systems still have an important role for extremely large datasets, streaming pipelines, and workloads that exceed the physical limits of a single machine, but they should be adopted only when those limits are actually reached.

Every week, I come across teams running massive PySpark clusters to process datasets that could easily fit on a single machine. The result? Bloated AWS bills and a false sense of “future-proofing” their data pipelines.

Teams used to outgrow single-machine frameworks. Now, they’re overspending on distributed ones. The main justification for choosing PySpark is usually:

“Let’s build for scale now no matter if we need it or not”

The bigger data rarely arrives, but their monthly cloud bill, like clockwork, always does.

The Hidden Cost of Overengineering

A modest PySpark cluster of 128 GB RAM and 64 cores will set you back about $2.72 per compute hour (four × EC2 c6i.4xlarge @ $0.68 each).
A single machine with the same specs costs around $2.46 per hour (EC2 c6gd.16xlarge @ $2.4576).

At a glance, the cluster seems only slightly more expensive — but that’s before you factor in:

Cluster overhead: coordination between the Spark driver and executors, shuffle management, and network I/O across nodes.
Serialization overhead: constant conversion of data between the JVM, Python, and network layers, which burns CPU and adds latency.
Distributed system headaches: failed executors, partition skew, missing shuffle files, dependency mismatches, long shuffle stages, and debugging across multiple nodes.

These costs apply primarily to on-demand or ephemeral clusters (e.g., EMR, Databricks, Snowflake, Dataiku). If you’re running a dedicated cluster 24/7, consider migrating away immediately. Unless your cluster is operating near full utilization, you’re effectively paying to heat a data center.

Although Spark’s Tungsten project seeks to alleviate many of these issues, all of this adds up to significant financial and cognitive drag for workloads that could have run perfectly well on a single machine.

How We Got Here

Back in 2019, data engineers had two main choices:

Pandas: Single-threaded, in-memory, fast for small datasets but limited by your machine’s RAM.
PySpark: Distributed, scalable, and capable of handling massive datasets across clusters.

The rule of thumb was simple:

If your data didn’t fit in memory, use PySpark.
If it did, still use PySpark, because someday it might not.

This logic made sense at the time. Pandas couldn’t utilize multiple cores or spill gracefully beyond RAM limits, so engineers defaulted to PySpark for both performance and peace of mind. As a result, PySpark became the de facto standard for most production-scale data work.

The New Middle Ground: DuckDB and Polars

We now have DuckDB and Polars, tools that bridge the once-wide gap between Pandas and PySpark. Both can:

Use multiple cores efficiently
Process data larger than memory through streaming or out-of-core execution
Scale to multi-terabyte workloads on a single machine
Run blazingly fast without the overhead of a distributed cluster

In other words, the decision to use PySpark is no longer driven by hardware limitations, it’s driven by financial ones.

DuckDB shines for SQL-style analytics and lightweight OLAP workloads, while Polars excels in Pythonic data frame transformations with parallel execution. Polars’ syntax is quite similar to PySpark, so if you are comfortable with PySpark, you will be comfortable with Polars. Together, they’ve reshaped what single-machine data processing can do and made PySpark unnecessary for most batch workloads under 32TB.

The Modern Data Framework Spectrum

Here’s a general aid to help decision making on which tool to use.

A few assumptions and implications come with these guidelines:

Micro data can be handled effortlessly by Pandas, Polars, or DuckDB.
While DuckDB and Polars are generally faster, the difference at this scale is negligible. The best choice usually depends on whichever syntax or workflow the analyst or developer is most comfortable with.

Medium to large data is defined here as anything up to roughly 32 TB which is a memory threshold set by the largest EC2 instance currently available (u7in-32tb.224xlarge). This isn’t a magic number or a rule of thumb, it’s the physical ceiling of vertical scaling in today’s public cloud. Beyond that limit, you simply can’t add more memory to a single machine, you have to scale horizontally with multiple nodes. A distributed cluster with equivalent memory and CPU capacity costs roughly the same per compute hour, but a single large instance avoids coordination overhead and network latency. Technically, DuckDB and Polars can spill to EBS or local NVMe storage to process data larger than memory, but once datasets grow this large relative to RAM, I/O throughput becomes the dominant bottleneck. At that point, a cluster begins to make both financial and operational sense because it can parallelize compute, disk I/O, and network I/O across many machines. A 32 TB RAM machine is the theoretical limit of DuckDB and Polars at this time, but I/O will be the main reason to move towards a cluster below the 32 TB threshold.

Performance also depends on data composition and transformation complexity:

Wide tables (hundreds of columns) increase per-row memory and disk access costs.
Nested or semi-structured data (like JSON or nested Parquet) adds CPU overhead for parsing and projection.
Expensive transformations such as joins, window functions, global sorts multiply intermediate data volumes and I/O load.
Conversely, columnar formats, partitioning, and predicate pushdown can dramatically extend how far a single machine scales efficiently.

Once your working dataset or intermediate state exceeds what the largest single machine can physically handle, a cluster becomes the only realistic option.

Streaming workloads are still best handled by PySpark.

Takeaway

The golden age of “big data” made distributed frameworks like PySpark the default choice for almost every team. But today, hardware has caught up, and software has evolved. Most workloads no longer need a cluster, they just need the right single-machine tool.

If your data fits comfortably under 32 TB and you’re not hitting I/O issues, you’re still within the limits of what a single EC2 instance can handle. Tools like DuckDB and Polars can now process terabytes efficiently on one node, without the cost or complexity of distributed computing.

Clusters still have their place, for true big data (beyond 32 TB), streaming workloads, or massively parallel compute, but they’re the exception, not the rule.

The new rule of thumb is simple:

“Scale out only when you’ve truly hit the limits of a single machine.”

Everything else?
Keep it simple. Keep it local. Shrink your bill at the end of the month.