What is Mixture-of-Recursions (MoR), and why is it different from the much-talked-about Mixture-of-Experts (MoE)? Where MoE scales width by activating only a subset of large expert modules per token, MoR scales depth by letting each token decide how many times it should re-enter a shared Transformer block. In other words: MoE distributes computation across experts, MoR recycles computation across recursions. The result is a model that can “think harder” only where needed: trivial tokens get shallow processing, complex ones receive more passes. This token-wise routing, combined with smart key–value caching, yields smaller models with better perplexity, higher throughput, and lower latency—up to ~2× improvements reported in recent studies. Unlike static deep stacks, MoR offers dynamic depth per token. Unlike MoE, it avoids ballooning parameter counts, making it especially appealing for edge deployments, enterprise inference at scale, and multimodal tasks where compute budgets are tight. Early experiments in both language and vision confirm its generality. Open questions remain: Will the gains hold at tens of billions of parameters? How stable will routing be in real production pipelines? But strategically, MoR reframes the race: not how many parameters we can afford, but how much useful depth per token per joule. #AIResearch #MixtureOfRecursions #AdaptiveComputation #TransformerArchitecture #AI #FutureOfIntelligence #Banking https://lnkd.in/dS6XYjxX
Buena pinta 🤗 gracias por compartir José Manuel de la Chica
Really insightful perspective on Mixture-of-Recursions. The idea of assigning dynamic depth per token reframes efficiency: it’s no longer just about “more parameters,” but about optimizing useful compute exactly where it drives value. From an enterprise and regulated-industry standpoint, this raises some key questions: • How stable will routing be in critical production pipelines? • What are the implications for model governance, where traceability and explainability are non-negotiable? Beyond the technical breakthrough, the strategic discussion is how such architectures can scale in real deployments — particularly in financial services or edge scenarios, where inference cost and energy efficiency are decisive.
Iif the architecture improvements improve the parallelization capabilities (and more optimal use of hardware resources, so to enable massive parallelization without massive associated costs), it will begin to enable the real game changer imho: atomization of contexts in "smart" pipelines (dedicated threads in the model dynamically allocating pipeline agents/tasks) that can deliver high-degree of confidence quasi-deterministic outcomes. Kind of micro-services (really micro) and automatically chained together according to best path and with proper context atomization that allows very small, but highly accurate, "links" to join together in the chain that can deliver a result.