GPU Racks and NV Switches
internal communication of a MoE model
If you’re following the AI world to see what things are happening, MoE (Mixture of Experts) is a word you must have come across. It was first introduced in 1991 by Geoffrey Hinton and his colleagues and is widely used in all models today.
MoE is a sparse model with active parameters → meaning out of all the weights in the model, only a subset of experts fire during each forward pass to produce a token. To put it in numbers: a model can have 256 experts, and when generating a single word, it might activate only 16, 32, or 64 of them. The selection is handled by a router, which is trained alongside the model and sits before every expert layer in the architecture.
Unlike dense models, where every weight participates in every operation, MoE models constantly make routing decisions, sending each token to different experts, which can be distributed across many GPUs in racks across a data center. That routing is what makes MoE efficient. It’s also what makes it physically complicated.
Problem: All-to-All Communication
In a dense model, every token passes through the same weights on the same GPUs — predictable, sequential, easy to plan for. MoE breaks that pattern entirely. The router evaluates each token and dispatches it to a specific expert, and those experts are distributed across many GPUs. At any given moment, every GPU might need to send data to every other GPU simultaneously.
This is the all-to-all communication problem. Think of it like a hospital triage system: instead of every patient seeing the same general doctor, a nurse instantly routes each patient to a specific specialist elsewhere in the building. With hundreds of patients being routed at once, every ward needs fast, direct access to every other ward. If the hallways are slow, specialists sit idle waiting, and the whole system stalls.
That communication bottleneck, translated to hardware, shapes everything about how modern AI infrastructure is physically built.
Solution: Racks, NVLink, and NVSwitch
This is addressed by clustering GPUs into racks, large physical cabinets that house dozens of GPUs along with their networking, power delivery, and cooling. NVIDIA’s Blackwell NVL72, for instance, surprisingly packs 72 GPUs into a single rack.
Within a rack, two technologies work together to create what’s called a scale-up network:
NVLink is Nvidia’s proprietary high-speed interconnect that replaces slower standard connections like PCIe. It lets GPUs exchange data fast enough that they effectively behave as one large, unified machine rather than dozens of separate chips.
NVSwitch solves the wiring problem that appears once you’re dealing with dozens of GPUs. Connecting every GPU directly to every other GPU would require thousands of intersecting cables, physically impossible at rack scale. Instead, every GPU connects to a central NVSwitch fabric. Any GPU can reach any other in just two hops, at full bandwidth, regardless of which pair is communicating.
Together, NVLink and NVSwitch turn a rack into something close to a single giant GPU, fast, tightly connected, with no meaningful communication penalty for routing data anywhere within it.
Crossing Racks
This speed has a hard boundary. The moment a model is too large to fit inside a single rack, data has to travel over the scale-out network, the data-center-wide fabric of InfiniBand or Ethernet that connects separate racks. That inter-rack communication can be roughly 8× slower than the intra-rack NVSwitch fabric.
For MoE models, this is particularly damaging. If an expert layer spills across two racks, a large fraction of tokens gets forced onto the slow network on every single forward pass. What was a routing efficiency gain quickly becomes a communication bottleneck.
Why Not Build One Giant Rack?
If crossing racks is 8× slower, the obvious question is: why not just keep scaling the rack until everything fits? The answer is physics.
Signal degradation: High-speed copper cables can only run so far before the signal deteriorates.
Connector density: Switches have a finite number of ports. Beyond a certain scale, the wiring physically doesn’t fit.
Power, heat, and weight: The electricity, cooling, and structural load required become unmanageable well before you reach the GPU counts that modern models need.
MoE models perform optimally only when a complete expert layer fits within a single rack-scale NVSwitch domain. That boundary, the edge of the fast fabric, is what drives the relentless push toward denser, more powerful racks. Every generation of hardware is trying to push that boundary outward, fitting more compute into the same tight, fast space.






this is deep sujay!