Magnet Push-Based Shuffle Service

The paper “Magnet: Push-based Shuffle Service for Large-scale Data Processing” is from LinkedIn.

The useful part is this:

Spark shuffle becomes slow at large scale because reducers read billions of tiny shuffle blocks from disk/network. Magnet fixes that by converting tiny fragmented shuffle blocks into larger merged blocks before reducers fetch them.

Core Problem

The central problem is not:

“Spark is slow”
“Network is slow”

The real problem is:

Tiny random reads.

In normal Spark shuffle, every map task writes data for every reduce partition. Later, each reducer fetches its block from many mapper outputs.

At LinkedIn scale:

Average shuffle block size was only tens of KB.
They had tens of billions of shuffle blocks daily.
Small random reads on HDDs are IOPS-bound.
Shuffle data is usually read only once in random order.
Normal caching does not help much.

Key Formula

number_of_shuffle_blocks = M * R
average_block_size = total_shuffle_data / (M * R)

Where:

M = number of mappers
R = number of reducers

As data grows, teams usually increase mappers and reducers to keep task size manageable.

That makes this value explode:

M * R

As M * R grows, the average shuffle block size becomes smaller.

So horizontal scaling can silently create a shuffle bottleneck.

Main Trick: Push-Merge Shuffle

The main trick in Magnet is push-merge shuffle.

After a mapper creates its normal shuffle file, it does not wait for reducers to pull tiny blocks later.

Instead, it pushes shuffle blocks to remote Magnet shuffle services.

Those services merge blocks belonging to the same reduce partition into larger files.

Later, reducers fetch fewer, larger, mostly sequential blocks instead of many tiny random blocks.

Design Decision: Best-effort Optimization

The important design decision is that Magnet is best-effort, not mandatory.

The original unmerged shuffle blocks are still kept.

If any of these fail:

Pushing fails.
Merging fails.
Fetching the merged block fails.

Then the reducer falls back to the original blocks.

This makes the optimization non-fatal.

This is a very reusable distributed-system pattern:

Add a fast path, but keep the old path as fallback until the fast path is proven safe.

Metadata Tracking

The second important trick is metadata.

Magnet tracks:

Which mapper blocks were merged.
Where merged files are stored.
Which original blocks are behind each merged file.

The shuffle service keeps metadata like:

bitmap
currentMapId
file offset

The purpose of each field:

bitmap       -> prevents duplicate writes
currentMapId -> avoids concurrent corruption while appending
offset       -> allows recovery if a partial write corrupts the merged file

Keep the Shuffle Service Dumb and Cheap

The third trick is to keep the shuffle service simple.

Magnet does not sort data inside the shuffle service.

It does not buffer many shuffle blocks in shuffle-service memory.

Sorting stays inside Spark executors.

Buffering also happens in executors, not in the central shuffle service.

The shuffle service mainly accepts pushed blocks and appends them to files.

This avoids turning the shuffle service into a CPU or memory bottleneck.

Exploit Locality Where Useful

The fourth trick is to exploit locality where useful.

In on-prem clusters where compute and disk are on the same machines, Magnet can place merged shuffle files near future reducer tasks.

That reduces network fetches.

With Spark dynamic allocation, Magnet inverts the normal logic.

Normally, Spark may choose shuffle services based on currently running executors.

Magnet instead chooses good shuffle-service locations first and later launches executors there.

Cloud-style Setups

In cloud-style setups, compute and storage are often separated.

In that case, locality is less useful.

There, Magnet changes the policy.

Instead of optimizing for locality, it can optimize for load balancing by choosing less-loaded shuffle services.

The same core idea still helps because larger chunks use network/storage bandwidth better than many tiny fragmented reads.

Bound Stragglers

The fifth trick is bounding stragglers.

Magnet does not wait forever for all push/merge operations to finish.

The Spark driver can set an upper wait limit.

After that, it tells Magnet services to stop merging and starts the reduce stage anyway.

Some blocks may remain unmerged, but the job still gets most of the benefit without letting the optimization itself become the bottleneck.

Handle Skew Separately

The sixth trick is handling skew by skipping huge partitions.

If one reduce partition is extremely large, merging it may create a massive merged file, possibly tens or hundreds of GB.

Magnet can skip blocks larger than a threshold and leave them to Spark’s adaptive skew handling.

The important lesson is:

Do not apply the same optimization blindly to normal and skewed data.

Overlap Work

The seventh trick is overlap.

Magnet pushes shuffle blocks in background threads so map-task execution and block pushing can overlap.

On the reduce side, it avoids fetching one giant merged file as a single unit.

It slices merged files into MB-sized pieces so fetching and reduce processing can still overlap.

This keeps pipeline parallelism alive.

Counterintuitive Lesson

The counterintuitive lesson is:

Writing data twice can still be faster.

Magnet writes:

original shuffle output
merged shuffle output

Normally, this sounds bad.

But the argument is:

Many tiny random reads are worse than extra writes.

Writes benefit from OS page cache and disk buffers.

Shuffle reads are random and usually one-time.

So the extra write cost is acceptable if it removes a much larger random-read cost.

Results

The result was large for small blocks.

In one benchmark:

Data transferred: 150 GB
Block size: 10 KB

Vanilla Spark: around 4 hours
Magnet: a little over 5 minutes

In production-style workloads, Magnet reduced total task runtime by approximately:

Workload 1: 69%
Workload 2: 66%
Combined workloads: 70%

It did not significantly help the small workload, but it also did not cause meaningful regression.

Reusable Engineering Lessons

When debugging Spark shuffle, first check shuffle block size, not only total shuffled bytes.

A job can have reasonable total data but still be slow because it creates too many tiny mapper -> reducer blocks.

Reducing reducer count is not always the correct fix because bigger reduce tasks may hurt memory, CPU, skew, and scheduling.

Convert many tiny random reads into fewer large sequential reads wherever possible.

Make performance optimizations best-effort with fallback, especially in distributed systems.

Keep shared infrastructure services lightweight. Push CPU/memory-heavy work to workers/executors where capacity is larger.

Treat skewed data separately. One optimization should not be forced onto both normal partitions and huge partitions.

Bound the waiting time of background optimization work. An optimization that waits forever becomes part of the critical path.

Preserve metadata visibility in the main engine. If Spark still knows where merged and unmerged data lives, it can schedule better and recover better.

Sometimes extra writes are cheaper than tiny random reads, especially on HDD-heavy systems.

One-line Summary

Magnet makes Spark shuffle faster by pushing mapper output to shuffle services early, merging tiny per-reducer blocks into larger files, then letting reducers read larger/local/fewer blocks, while keeping the old unmerged path as fallback.