Part 10
Completed

Two-Expert Router

Implemented a Mixture of Experts layer with a learned routing mechanism, load balancing, and expert capacity management. Started with a two-expert system to understand routing dynamics, then scaled to multiple experts with top-k routing.

What I Built

Implemented a Mixture of Experts layer with a learned routing mechanism, load balancing, and expert capacity management. Started with a two-expert system to understand routing dynamics, then scaled to multiple experts with top-k routing.

Key Concepts

Mixture of ExpertsLearned RoutingLoad BalancingExpert CapacityTop-K RoutingConditional Computation

Architecture

1
Router Network
2
Expert Networks
3
Load Balancer
4
Capacity Manager
5
Gating Mechanism

Results

Two-expert system achieves 90% of dense model performance with 50% active parameters. Load balancing prevents expert collapse.

Key Learnings

  • Routing is the critical component—bad routing destroys model quality
  • Load balancing losses are essential for training stability
  • Expert specialization emerges naturally with proper routing

Challenges

  • Preventing expert collapse (all tokens routed to one expert)
  • Balancing load while maintaining routing quality
  • Managing expert capacity constraints