The invisible engineering behind Lambda’s network

XKCD 2259 — Source: https://xkcd.com/2259/

A special thanks to the engineers who shared their story with me and have helped bring this blog post to life: Ravi Nagayach, Prashant Singh, Kshitij Gupta, and the entire Lambda networking team. These are folks doing the invisible engineering that keeps AWS running.

Most infrastructure improvements at AWS happen invisibly. Engineering teams spend years incrementally rebuilding systems that millions of customers depend on, while those systems continue running at full scale without disruption. Marc Olson described this as converting a propeller aircraft to a jet while it’s in flight. One mistake and the plane goes down. But get it right… and no one notices.

This is the work that will never make headlines or get a blog post (at least not when things go as planned). Work like optimizing iptables rules, working around kernel lock contention, or rewriting packet headers. Where success is silent. The reward is knowing what you’ve worked on is better today than it was a week ago, and that the next team won’t run into the same constraints you just removed.

I’ve been thinking about this a lot lately. There are big launches like S3 Files, which solve very visible customer problems, and then there is the work that’s just as impressive that happens quietly, over long periods of time, and just out of sight of our customers. Today, I want to share a Lambda story with you that’s spanned the better part of a decade, and that’s made things we thought impossible, such as running latency sensitive workloads in a serverless function, well, possible. It’s the story of Lambda’s networking team, and how their subtle inventiveness has both transformed what’s possible with Lambda and impacted how and what we can build across AWS.

What is a network topology?

Before we get into the weeds, it helps to understand what a network topology is, because it’s the foundation for everything that follows in this blog. A network topology is the arrangement of devices, connections, and rules that determine how data moves between points in a system. Think of it as the plumbing. It defines which paths exist, how traffic gets routed, how isolation is enforced between tenants, and what happens when a packet needs to travel from point A to point B. In a cloud environment, this plumbing is software-defined—built from virtual devices, tunnels, routing rules, and packet filters rather than physical cables and switches.

When you’re running a single application on a single machine, the topology is trivial. But when you’re running millions of lightweight virtual machines on shared hardware, each needing its own isolated network path, its own security boundaries, and the ability to connect to a customer’s private network, the topology becomes one of the most consequential design decisions that you make. Every device you add, every rule you create, every tunnel you establish has a cost in latency, CPU, and memory. And those costs multiply with density. Get the topology right and builders just see fast, reliable connectivity.

For Lambda, this is where our story begins. With a network topology that served non-VPC functions well, but one that imposed a real cost on functions connecting to a customer’s VPC.

The VPC cold start problem

A Lambda cold start happens when Lambda has to create a new micro VM to handle an invoke, because there is no warm execution environment already available to take on the work. Creating the execution environment includes allocating the micro VM, downloading the customer’s code, starting the language runtime, and running the customer’s initialization code, all before the invoke payload ever reaches a customer’s handler. A VPC cold start is all of that plus the additional network setup required for the microVM to reach resources inside a customer’s private network. This overhead is why VPC cold starts have historically been slower than non-VPC cold starts.

When Lambda migrated to Firecracker microVMs in 2019, cold start overhead dropped from over ten seconds to under a second. Throughout the year, the team continued to chip away at the remaining latency with targeted fixes, however, setting up the Generic Network Virtualization Encapsulation (Geneve) tunnel that routes a Lamba function’s traffic to the correct customer VPC, along with DHCP, was still consuming 300 milliseconds. For some workloads, that was a manageable tradeoff, but for builders designing responsive applications, it was a real barrier. And the team’s experiments showed it would get worse with density.

The team had been tracking cold start metrics across both VPC and non-VPC configurations, and at higher microVMs densities, observed tail latencies were growing from hundreds of milliseconds to seconds. The root cause wasn’t obvious, so they instrumented the full path and ran a series of experiments, varying concurrency, density, a mix of create and destroy operations. What they found was that the dominant contributor was tunnel creation itself. Every packet traveling through a Geneve tunnel carries a Virtual Network Identifier (VNI), and that VNI has to be set when the tunnel is created. In Lambda’s case, the VNI wasn’t available until function initialization, and Linux offered no way to update it after the tunnel was created.

Writing a custom kernel driver was on the table, but maintaining Lambda-specific patches upstream indefinitely wasn’t a trade-off the team was willing to make. The real choice was between the Data Plane Development Kit (DPDK) or extended Berkeley Packet Filter (eBPF). eBPF was the less traveled path, but projects such as Cilium were proving its utility at scale. The team would be among the first in Lambda to use it in production, and there were real questions about whether it would hold up at scale and pass the security reviews that came with it. But it offered lower overhead than DPDK, and more importantly, it put the team in control of their own infrastructure. So they built a proof of concept.

Tunnels were created with dummy VNIs during pooling. When a function initialized and the real VNI became available, an eBPF program mapped the dummy VNI to the real VNI, rewriting the Geneve header on egress and reversing it on ingress. Geneve tunnel latency dropped from 150 milliseconds to 200 microseconds. Expensive tunnel creation moved off the hot path entirely.

With this solution, the team had also removed a fundamental blocker for packing more microVMs onto each worker, and reduced a source of CPU heat during bursts of cold starts, which improved the platform’s ability to absorb traffic spikes and handle scenarios like availability zone evacuations.

Lambda latency dropped from 150ms to 200μs — Drop in latency spikes from 150ms to 200μs

With Geneve tunnel latency down from 150 millisecond to 200 microseconds, the platform overhead for VPC cold starts was no longer the bottleneck. DHCP remained open and still does, a multi-phase effort the team is currently working through. But the headroom that this work created was significant, and would become the foundation for SnapStart.

Reimagining our network topology (out of necessity)

Lambda SnapStart presented a new set of challenges for our engineers. Instead of initializing each function one at a time from scratch, SnapStart takes a snapshot of an already initialized execution environment and clones it to serve multiple concurrent invocations simultaneously. Because the initialization work happens once at snapshot time and not on every invocation, cold start times dropped dramatically, particularly for Java workloads where initialization overhead had always been highest. The team had a new obstacle to solve as each clone needed its own isolated network namespace with separate tap, bridge, veth, and tunnel devices, ready before the VM started. The original design created these on demand, but SnapStart needed them pre-created and ready to attach.

Each host had capacity for up to 2,500 micro VMs. When SnapStart launched, both topologies ran on the same hosts, with the 2,500 slots split between them, 200 allocated to the new snapshot topology and 2,300 for on-demand workloads. The 200 cap was a deliberate trade-off. These networks required twice as many Linux network devices per VM, and the cost to create and destroy them grew with density. With each new device there was a penalty. Full fleet adoption wasn’t expected immediately, they figured they had a year of runway, so they made the choice to launch with a lower cap and come back to the scaling problem later.

Shipping with a split topology and a cap of 200 was the right call for launch, but Lambda was moving toward snapshot-based VMs for all workloads, and two topologies running side-by-side indefinitely was a tax that they were unwilling to pay. The team needed to converge them and scale from 200 to 2,500 snapshot networks per host.

One bottleneck at a time

When the team started scale testing the snapshot topology, the first issue they ran into was network creation itself. Creating Linux network devices (tap, veth, namespaces) got slower as density increased, and running destroys alongside creates made everything stall.

Every time a new device was created, Linux had to traverse its existing device lists, so the cost of creating the N+1 network grew with N. At their target density of 4,000 networks (up from 2,500 across both topologies), with Lambda’s constant VM turnover, the overhead never stopped accumulating. The best solution, it turned out, was to stop creating networks on demand altogether. Instead of paying the cost during function invocation, the team moved all of it to worker initialization, pre-creating all 4,000 networks before the worker ever started a request. On the surface, spending three minutes creating networks before a worker can do anything useful sounds shaky, but Lambda workers cycle infrequently compared to microVMs, which changes the math entirely. As Ravi put it, “absorbing the cost once at boot rather than paying it continuously during operation” was the right call, and the CPU drain during function execution disappeared. Colm MacCárthaigh calls this constant work—systems that do the same amount of work regardless of load, like a coffee urn that keeps hundreds of cups warm whether three people show up or three hundred. The worker always pays the same boot cost. It was one layer, but there were more.

The NAT implementation was another source of pain. The original system used iptables for stateful Network Address Translation. Packets underwent double NAT, once in the VM’s network namespace and again on the eth0 interface. At high densities, with thousands of VMs processing traffic simultaneously, the kernel had to maintain and query connection tables for every packet. The contention added significant latency. The team replaced stateful NAT with stateless packet mangling using eBPF, rewriting headers based on predetermined mappings instead of tracking connection state. NAT setup latency dropped by 100x.

And then there were iptables rules, which do a lot of heavy lifting, from routing to NAT to filtering, but at their core they are a set of rules the kernel evaluates in sequence for every packet, deciding what is allowed and where it goes. The configuration had grown to over 125,000 rules in the root network namespace. This wasn’t accumulated cruft or a discipline issue, but a density problem. Each VM slot required roughly 30 rules organized across chains and jumps for management and data traffic. Multiply that by 4,000 slots and add the fixed rules that applied globally, and you get a sense of how the configuration grew to over 125,000 rules. It was a density problem, not a discipline problem. Each network slot required its own chains, and every packet had to traverse the rules in sequence. A packet for slot 0 processed quickly. A packet for slot 4,000 walked through thousands of additional rules, adding up to a millisecond of connection setup latency from rule traversal alone. The team moved the 30 slot-specific rules into each individual network namespace, reducing the root namespace from 125,000+ rules to just 144 static, slot-agnostic rules. The performance skew between slots disappeared.

Graph of iptables rules reduction — What it looks like to go from 125,000+ iptables rules to 144 static, slot agnostic rules

Network pooling eliminated the CPU drain. Stateless NAT removed the conntrack table bottleneck. Simplifying iptables fixed the performance skew. Still network creation was slower than it needed to be.

The culprit was Routing Netlink (RTNL) lock, Linux’s way of ensuring that only one thing can modify the network configuration at a time. It’s a necessary guardrail, but at scale a bottleneck. When the team tried to create thousands of network devices and namespaces in parallel during worker boot, operations queued behind the lock. What should have taken seconds stretched to minutes. It’s a bit like when a car breaks down on a bridge in Amsterdam (a city that is not designed for cars). First the car behind it gets stuck, then the car behind that one, then a tram, and on-and-on until the entire city is gridlocked. That’s why I ride my bike.

For Lambda, the fix was to rethink the order of operations. Pool network namespaces first, create veth pairs inside the namespace before moving them to root, and batch eBPF program attachments for all veth devices in a single operation instead of one at a time. The queuing disappeared.

Invisible engineering

Lambda now runs a single, unified network topology supporting both traditional and snapshot-based workloads. This is what years of invisible engineering look like in practice.

The team scaled from 200 to 4,000 snapshot networks per worker, a 20x increase in capacity, with benchmarks showing potential for even more. All 4,000 networks are created in three minutes during worker initialization, with no background CPU drain during invokes. The iptables simplification eliminated performance variation between network slots. Every packet now traverses the same 144 rules regardless of slot assignment. And the combined optimizations lowered CPU usage by 1%. At Lambda’s scale, each percent translates to significant infrastructure savings.

When the team building Aurora DSQL needed scalable Firecracker-based networking with the right security and performance characteristics, they reached out to Lambda’s networking team. Rather than have them rebuild everything from scratch, the team encapsulated the full networking stack into a service that DSQL could install and run on their own workers. The service handles device management, firewall rules, NAT translation, and the security hygiene required to safely reuse a network after release. DSQL requests a network when it needs one for a VM and releases it when done. Lambda owns the service and vends new versions, and every optimization they make flows to DSQL automatically. It saved the DSQL team months of engineering effort and gave them Lambda-grade networking density from day one.

This is the job

Most of what we build at AWS, nobody will ever see. A customer deploys a Lambda function that connects to their VPC and it starts in milliseconds. They don’t think about the Geneve tunnels underneath, or the iptables rules, or the kernel mutex that had to be worked around to make that possible. They shouldn’t have to.

This particular effort took the better part of a decade, and it didn’t come with a product launch or a press release. The team converged two network topologies into one, eliminated bottlenecks at every layer of the stack, and scaled capacity by 20x. When they were done, Lambda functions started faster and ran more efficiently. And most customers never noticed the change. But the demand for faster cold starts hasn’t slowed down. If anything, it’s accelerated as new workloads push Lambda in directions we couldn’t have anticipated five years ago.

The engineers who did this work knew that going in. Optimizing iptables rules and working around kernel lock contention doesn’t make headlines. But there is a professional pride that comes from doing the “thing” properly even when nobody’s watching. Pride in the unseen systems that stay up through the night. In clean deployments. In rollbacks that go unnoticed. In the research. In listening to the community and working collaboratively on changes. Or knowing the system is better today than it was yesterday, and that the next team who works on it won’t hit the constraints you just removed.

This is what defines the best builders and the best teams. They do the work not because someone is going to write about it, but because it’s the right thing to do. Aristotle called this “Arete”, the relentless and lifelong pursuit of excellence. And when I look at what these networking engineers have delivered, quietly and incrementally, I see that commitment everywhere.

Now, go build!