José Manuel de la Chica’s Post

3mo

What is Mixture-of-Recursions (MoR), and why is it different from the much-talked-about Mixture-of-Experts (MoE)? Where MoE scales width by activating only a subset of large expert modules per token, MoR scales depth by letting each token decide how many times it should re-enter a shared Transformer block. In other words: MoE distributes computation across experts, MoR recycles computation across recursions. The result is a model that can “think harder” only where needed: trivial tokens get shallow processing, complex ones receive more passes. This token-wise routing, combined with smart key–value caching, yields smaller models with better perplexity, higher throughput, and lower latency—up to ~2× improvements reported in recent studies. Unlike static deep stacks, MoR offers dynamic depth per token. Unlike MoE, it avoids ballooning parameter counts, making it especially appealing for edge deployments, enterprise inference at scale, and multimodal tasks where compute budgets are tight. Early experiments in both language and vision confirm its generality. Open questions remain: Will the gains hold at tens of billions of parameters? How stable will routing be in real production pipelines? But strategically, MoR reframes the race: not how many parameters we can afford, but how much useful depth per token per joule. #AIResearch #MixtureOfRecursions #AdaptiveComputation #TransformerArchitecture #AI #FutureOfIntelligence #Banking https://lnkd.in/dS6XYjxX

3 Comments

Abel Losada Esperante

3mo

Iif the architecture improvements improve the parallelization capabilities (and more optimal use of hardware resources, so to enable massive parallelization without massive associated costs), it will begin to enable the real game changer imho: atomization of contexts in "smart" pipelines (dedicated threads in the model dynamically allocating pipeline agents/tasks) that can deliver high-degree of confidence quasi-deterministic outcomes. Kind of micro-services (really micro) and automatically chained together according to best path and with proper context atomization that allows very small, but highly accurate, "links" to join together in the chain that can deliver a result.

2 Reactions

Alejandro Vaca Serrano

3mo

Buena pinta 🤗 gracias por compartir José Manuel de la Chica

1 Reaction

Paulo Garcia

2mo

Really insightful perspective on Mixture-of-Recursions. The idea of assigning dynamic depth per token reframes efficiency: it’s no longer just about “more parameters,” but about optimizing useful compute exactly where it drives value. From an enterprise and regulated-industry standpoint, this raises some key questions: • How stable will routing be in critical production pipelines? • What are the implications for model governance, where traceability and explainability are non-negotiable? Beyond the technical breakthrough, the strategic discussion is how such architectures can scale in real deployments — particularly in financial services or edge scenarios, where inference cost and energy efficiency are decisive.

See more comments

To view or add a comment, sign in

More Relevant Posts

AI news

83 followers
2mo
Report this post
- DeepSeek V3.2-Exp debuts DeepSeek Sparse Attention （DSA）: a lightning indexer + fine-grained top‑K token selection that cuts attention complexity from O（L²） to O（Lk） while preserving quality. - Real-world efficiency: 50% API price cut; 2–3x faster long‑context inference; 30–40% lower memory; 50% faster training; 10x cheaper 128K token decoding vs dense. - Rigorous training pipeline: dense warm‑up then sparse training （943.7B tokens）, followed by post‑training with specialist distillation and single‑stage GRPO for balanced rewards. - Hardware co-design: optimized FlashMLA/DeepGEMM kernels on Hopper H800 GPUs hit near-peak bandwidth （3000 GB/s）, with open-source PRs. - Performance profile: parity overall vs V3.1; clear gains in coding/agentic tasks; minor regressions on ultra-abstract reasoning benchmarks （e.g., GPQA Diamond, HMMT）. 🔔 Follow us for daily AI updates! 📘 Facebook: https://lnkd.in/gxDt7PJa 📸 Instagram: https://lnkd.in/gmYfWDbF #DeepSeek #AI #LLM #GenerativeAI #AIGenerated #CreatedWithAI
Like Comment
To view or add a comment, sign in
Everton Gomede, PhD
2mo
Report this post
Why AI Fails: Data Ingestion Patterns Matter More Than Models Most AI failures don’t stem from weak models—they often begin with flawed data ingestion. This essay examines key design patterns, including Full Loader, Incremental Loader, and CDC, demonstrating how they influence trust, scalability, and performance. If you build or run ML systems, these insights will help you strengthen the foundation on which your models depend. Curious which ingestion pattern protects your ML pipeline from silent failure? Dive into the full breakdown here. #DataIngestion #MachineLearningOps #ChangeDataCapture #AIinProduction #StreamingAnalytics #BigDataEngineering #PredictiveMaintenance #RealTimeAI #DataPipelines #AITrust

Data Ingestion Design Patterns: Rethinking the Flow of Knowledge medium.com
Like Comment
To view or add a comment, sign in
Shailesh Thorat
1mo
Report this post
I literally loved the straight forward explanation in the docs about the actual architecture of how Mem0 adds up the important stuff into the memory. Simply put, Mem0 actually runs a small internal LLM pipeline before persisting the data. Basically, it’s distilling semantic essence from your text - kind of like how we humans form “mental summaries” when remembering something. Then these extracted “facts” (which are cleaner, smaller, and more meaningful) get stored as memory embeddings + structured metadata inside the Mem0 store. It’s not just retrieval - it’s understanding before storing ❤️🔥 #Mem0 #AI #LLM #AIEngineering #ArtificialIntelligence #Memory #LangGraph #LangChain #GenerativeAI #LearningInPublic #MachineLearning
Like Comment
To view or add a comment, sign in
Jaime Diaz-Beltran
1mo
Report this post
State of LLMs in Late 2025 The AI landscape has evolved into a hyper-specialized ecosystem where each model has distinct strengths. The challenges emerging include diminishing returns on scaling, massive energy consumption, and the rise of smaller specialized models. This guide explains the technical foundations that make each model different so developers can choose the right tool for their task. Success now means understanding each model's strengths, testing on specific use cases, routing intelligently, and staying current. https://lnkd.in/ednh6N9v

State of LLMs in Late 2025 blog.arcbjorn.com
Like Comment
To view or add a comment, sign in
Matthew Valitski
2mo
Report this post
Why do multi-agent AI systems collapse at scale? It’s not communication gaps, but missing memory. Without shared, persistent memory, agents repeat work, lose context, and drive up token costs exponentially. This article explores memory engineering—from structured memory blocks to a MongoDB Atlas–powered exocortex, that aligns agent states, optimizes data retrieval, and scales coordinated AI teams with reliability and cost efficiency. Learn more: https://lnkd.in/g2uB5ggA
Like Comment
To view or add a comment, sign in
Marc Galle
2mo
Report this post
𝗗𝗲𝗲𝗽𝘀𝗲𝗲𝗸𝘀 𝗻𝗲𝘄 𝗺𝗼𝗱𝗲𝗹 𝘃𝟯.𝟮-𝗲𝘅𝗽 has landed. The most important feature of the new experimental model is called DeepSeek Sparse Attention, the architecture allows the models to operate over long portions of context with comparatively small server loads. This means the price of a simple API call could be reduced by as much as half in long-context situations. Great for chatbots but because of the nature of processing (scans long sentences for key words, as opposed to word pairs) it may lack accuracy needed for medical or legal requirements. Test away, its on Hugging Face #it #ai #compute https://lnkd.in/g64u7Gy7
Like Comment
To view or add a comment, sign in
Alex C.
1mo
Report this post
Qwen3‑Next: Redefining Efficiency in Large Language Models The Qwen team just unveiled Qwen3‑Next, a major leap in model architecture — balancing scale, efficiency, and long‑context reasoning like never before. Core breakthrough: - 80B parameters, but activates only ~3B per inference step - Matches the dense Qwen3‑32B performance - Trains at < 10% of the cost - Delivers > 10× faster inference on 32K–256K token contexts Key innovations: - Hybrid Attention (Gated DeltaNet + Gated Attention) – combines the recall of standard attention with the efficiency of linear attention (optimal 3 : 1 mix). - Ultra‑Sparse MoE – 512 experts with only 3.7 % active at a time, dynamically balanced for stability and throughput. - Stability‑Optimized Training – Zero‑Centered  RMSNorm and normalized routing for reliable convergence under high sparsity. - Multi‑Token Prediction (MTP) – enables efficient speculative decoding and higher generation speeds. Open‑source availability: On Hugging Face, ModelScope, and compatible with vLLM + SGLang for up to 256K context. The new Instruct and Thinking models outperform their 30B–32B counterparts and approach results of massive 235B models — a compelling case for smarter, not bigger, model scaling. Takeaway: The next era of LLMs is defined by two fronts — context‑length scaling and activation sparsity. Qwen3‑Next shows that real progress now lies in efficiency engineering, not raw parameter count. Full deep dive https://lnkd.in/eEuY-7-4 #AI #LLM #Qwen #MixtureOfExperts #DeepLearning #GenerativeAI #ModelOptimization #AIResearch #Efficiency
Like Comment
To view or add a comment, sign in
Imad KENAI
1mo
Report this post
𝗘𝘃𝗲𝗿𝘆 𝗺𝗶𝗹𝗹𝗶𝘀𝗲𝗰𝗼𝗻𝗱 𝘆𝗼𝘂 𝘀𝗮𝘃𝗲 𝗶𝗻 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗺𝗶𝗴𝗵𝘁 𝗱𝗲𝗽𝗲𝗻𝗱 𝗼𝗻 𝗼𝗻𝗲 𝘁𝗵𝗶𝗻𝗴: 𝗵𝗼𝘄 𝘆𝗼𝘂 𝗵𝗮𝗻𝗱𝗹𝗲 𝘁𝗵𝗲 𝗞𝗩 𝗰𝗮𝗰𝗵𝗲. One of the main challenges in serving LLMs efficiently is managing what’s called the 𝗞𝗩 𝗰𝗮𝗰𝗵𝗲 (Key-Value attention cache). When you send a prompt to an LLM, the model doesn’t just process it once and move on. For every token it generates, it needs to look back at all the previous tokens (your query and the tokens it already produced) to decide what comes next. To avoid recomputing that entire history every time, the model stores the key and value representations of past tokens. This is the KV cache. It sounds simple if you’re familiar with transformer architectures, but this cache can grow massive, sometimes even larger than the LLM itself. And since it lives in memory, managing it efficiently becomes one of the biggest bottlenecks in LLM inference. And when most of the compute bill today comes from inference, every improvement in KV cache handling matters. In future posts, we’ll dive deeper into how different serving frameworks handle this challenge, because managing memory is just as critical as managing models. #LLM #AI #MachineLearning #DeepLearning #MLOps #AIInfrastructure #KVCaching #LLMServing #Inference
Like Comment
To view or add a comment, sign in
Vincent Foggia
2mo
Report this post
Why do multi-agent AI systems collapse at scale? It’s not communication gaps, but missing memory. Without shared, persistent memory, agents repeat work, lose context, and drive up token costs exponentially. This article explores memory engineering, from structured memory blocks to a MongoDB Atlas–powered exocortex, that aligns agent states, optimizes data retrieval, and scales coordinated AI teams with reliability and cost efficiency. Learn more: https://lnkd.in/gyDN4ytG
Like Comment
To view or add a comment, sign in
Cassiano Ziegler Bein
1mo
Report this post
Why do multi-agent AI systems collapse at scale? It’s not communication gaps, but missing memory. Without shared, persistent memory, agents repeat work, lose context, and drive up token costs exponentially. This article explores memory engineering—from structured memory blocks to a MongoDB Atlas–powered exocortex, that aligns agent states, optimizes data retrieval, and scales coordinated AI teams with reliability and cost efficiency. Learn more: https://lnkd.in/gCqKib9a
Like Comment
To view or add a comment, sign in

14,939 followers

View Profile Follow

LinkedIn respects your privacy

José Manuel de la Chica’s Post

More from this author

Beyond Asimov’s Laws: Engineering Ethics and Control for the Age of Agentic AI

The Hidden Intelligence: Why Scheming in AI Is the Industry’s Next Grand Challenge

AGI-Readiness: A Foundational Wake-Up Call for Corporate Strategists and Boards

Explore content categories

José Manuel de la Chica’s Post

More Relevant Posts

More from this author

Beyond Asimov’s Laws: Engineering Ethics and Control for the Age of Agentic AI

The Hidden Intelligence: Why Scheming in AI Is the Industry’s Next Grand Challenge

AGI-Readiness: A Foundational Wake-Up Call for Corporate Strategists and Boards

Explore related topics

Explore content categories