Modular

Modular · 2026-06-05T17:54:30.986Z

Inference routing has been one of the most active areas in LLM infrastructure research over the past year. Today we're publishing the final post in our three-part series on how Modular Cloud approaches it. The core idea: instead of shipping a fixed set of routing algorithms, we built a five-stage plugin pipeline (Prepare → Filter → Score → Pick → Execute) where behaviors compose. Session-sticky plus cache-aware routing isn't a new implementation - it's five existing plugins and a config file. A new execution pattern like disaggregated prefill/decode doesn't require a new HTTP handler, rather it's a different Workflow and Executor on top of the same framework. That composability is what lets us rapidly implement new routing optimizations as the field moves. And the field is moving fast: disaggregated prefill/decode, KV cache-aware placement, and multi-step execution coordination have all gone from research ideas to production requirements in under two years. Part 3, written by Aayush Deshpande, Deep Dhillon, Alexandr Nikitin, and Michael Dunn-O'Connor, covers the decision layer in full: how the five stages work, how plugins share typed state without coupling, how the framework validates compositions at build time, and how the Selector / Workflow / Executor abstraction handles multi-pod execution. Full article: https://lnkd.in/edi5mTWy

Software Development

Enable AI to be used by anyone, anywhere

See jobs Follow

Discover all 373 employees

About us

The next-generation AI developer platform unifying the development and deployment of AI for the world.

Website: https://www.modular.com
External link for Modular
Industry: Software Development
Company size: 51-200 employees
Headquarters: Everywhere
Type: Privately Held
Founded: 2022
Specialties: machinelearning, ai, software, tensorflow, pytorch, and hardware

Locations

Primary

Everywhere, US

Get directions

Employees at Modular

See all employees

Updates

Modular

26,774 followers
11h
Report this post
Modular 26.4 is out. Today's release brings state-of-the-art Mixture-of-Experts serving to Modular Cloud, expands MAX support for the newest open-weight models, and takes another step toward Mojo 1.0. Modular Cloud now supports the latest frontier models, including MiniMax's M3, Z.ai's GLM 5.2, and Kimi (Moonshot AI)'s Kimi 2.7. 26.4 also ships enhanced quantization and speculative decoding capabilities, extended Apple silicon GPU support, model bring-up via agent skills, and more. Dive into all the changes: https://lnkd.in/gUcWiigC

Modular: Modular 26.4: SOTA MoE Serving, Model Bringup via Agent Skills, Mojo Beta 2 and More modular.com

Like Comment Share
Modular

26,774 followers
1d
Report this post
ModCon is back. August 18th. San Francisco. A full day of AI infrastructure talks, launches, and workshops, with speakers including Dylan Patel (SemiAnalysis), Jerry Liu (LlamaIndex), Paige Bailey (Google DeepMind), and Sid Sheth (d-Matrix). Limited spots. Early-bird pricing ends July 1st. Get your ticket: https://lnkd.in/gSUhTSFa

3 Comments

Like Comment Share
Modular

26,774 followers
2d
Report this post
Z.ai open-sourced GLM 5.2 today, and Modular is a Day Zero launch partner. GLM 5.2 is their new flagship model, built for coding and long-horizon agentic work, with usable 1M-token context. It's available on Modular Cloud now. Serving long context well is a full-stack problem. As context grows, the KV cache grows with it, and attention and scheduling costs climb fast. The Modular stack optimizes the path from GPU kernels to serving, which lets us run frontier open models on Day 0 with the utilization and economics agent workloads need. Request access on Modular Cloud: https://lnkd.in/g7RMmiCA + learn more about GLM 5.2's benchmarks & architecture: https://z.ai/blog/glm-5.2

3 Comments

Like Comment Share
Modular

26,774 followers
3d
Report this post
Tim Davis built a retro platformer where the terrain is generated by SIMD kernel computation. Meet GPU Boost Adventure. A neon, CRT-flavored endless runner: https://boost.modular.com/ A Mojo program generates unique levels 8 SIMD lanes at a time representing the thermal field of a GPU. Heat creates bigger gaps, vents, and choppier terrain (the die is throttling). Coolant adds boost pads and headroom. Two paths feed the game's fields. One precomputes the fields with SIMD. The other compiles a Mojo kernel to WebAssembly and builds them live in your browser. Play it now at https://boost.modular.com/. Explore how it's made at https://lnkd.in/gTd8SHS2. Comment with your high score!
1 Comment

Like Comment Share
Modular

26,774 followers
6d
Report this post
MiniMax released the M3 open weights today, and Modular is an official Day Zero launch partner. M3 is a hard model to serve well. The architecture is block-sparse GQA attention with a lightweight indexer (MiniMax Sparse Attention) and an MXFP8 mixture-of-experts. This setup is novel enough that running it efficiently requires custom kernels. MSA cuts per-token attention compute to roughly 1/20th of full attention, making a 1M-token context economical to serve, not just technically possible. MAX, our inference framework, owns the full stack from GPU kernels to serving. We built the kernels M3 needs and had it running same-day. That full stack ownership is what lets us optimize cost, with higher batch utilization and better economics for long-running agent workloads. M3 supports text, image, and video input. Available on Modular Cloud now by request: https://lnkd.in/ek9-cYCS Read our full announcement: https://lnkd.in/eqWCh2d4

1 Comment

Like Comment Share
Modular

26,774 followers
1w
Report this post
Running 400B+ parameter models for real-time patient conversations is a hard problem. Running them reliably across heterogeneous hardware as your infrastructure scales is harder. Hippocratic AI manages both. Their Polaris system orchestrates dozens of specialized models in parallel, contacting tens of thousands of patients per day, with error rates lower than human clinicians. When they evaluated MAX with NVIDIA B300 GPUs, the priorities were TTFT, tail latency, and hardware portability. MAX cleared all of them: sub-500ms mean TTFT, ~30% faster P99 end-to-end latency, and ~22% faster mean end-to-end latency at scale. Because MAX's performance comes from Mojo-native kernels rather than vendor-specific code, the same deployment extends to AMD and other accelerators without rebuilding the stack. Read the full blog post: https://lnkd.in/eZNADHwW

Modular: Hippocratic AI partners with Modular to power flexible, high-quality inference for real-time patient conversations modular.com

Like Comment Share
Modular

26,774 followers
1w
Report this post
Mojo 1.0 Beta is out, and the community is already building on it. Last month: Decimo v0.10.0 with Rust benchmarks, the GNU Scientific Library ported to Mojo, a Kafka client, and more. Catch up with the latest community highlights: https://lnkd.in/e7aVqZu9

Modular: Modverse #55: Mojo 1.0 Beta, Community Mojo Libraries, and Real-Time Patient Conversations Powered by MAX modular.com

Like Comment Share
Modular

26,774 followers
1w Edited
Report this post
MiniMax M3 is about to drop as open weights, and we've spent the week getting ready for it. A 1M-token context window, native multimodality, and an architecture that takes serious engineering to serve efficiently. That's the work our inference and kernel teams have been heads-down on, so that M3 runs fast and reliably the moment it's public, rather than weeks later. When the weights land in the next few days, you'll be able to run it on Modular. Congratulations to the MiniMax team on what's shaping up to be a landmark launch across state-of-the-art coding performance, agentic workflow support, and native multimodal reasoning. Read the full announcement of our day zero M3 support: https://lnkd.in/eqWCh2d4 #AI #Inference #MachineLearning #LLM #MiniMax
1 Comment

Like Comment Share
Modular

26,774 followers
1w
Report this post
We just dropped a new version of Inkwell, our image gen storybook app. Explore founding stories from your favorite companies, powered by open models and Modular Cloud: https://lnkd.in/ecEYw2kv Your favorite companies already have stories waiting for you: Stripe, OpenAI, Apple. Don't see your company's story? Type in your company's URL and watch it unfold. Inkwell is built on Modular Cloud. Low-latency, high-quality image generation at scale. Comment or DM us to learn more about building with Modular Cloud. A few of our favorite founder stories: Four Researchers, One Question Aravind Srinivas from Perplexity: https://lnkd.in/eN9CuRpV Leaving the Nest Reid Hoffman and Mustafa Suleyman from Inflection AI: https://lnkd.in/ebWGgWaB The 2014 Spark Yair Adato from Bria AI: https://lnkd.in/ef-q5Kb2 A Paper That Felt Like Magic Victor Riparbelli from Synthesia: https://lnkd.in/epEy2qEk If you're deploying image gen at scale, our team wants to hear what you're building: https://lnkd.in/eFnyMp3S
1 Comment

Like Comment Share
Modular

26,774 followers
1w
Report this post
Inference routing has been one of the most active areas in LLM infrastructure research over the past year. Today we're publishing the final post in our three-part series on how Modular Cloud approaches it. The core idea: instead of shipping a fixed set of routing algorithms, we built a five-stage plugin pipeline (Prepare → Filter → Score → Pick → Execute) where behaviors compose. Session-sticky plus cache-aware routing isn't a new implementation - it's five existing plugins and a config file. A new execution pattern like disaggregated prefill/decode doesn't require a new HTTP handler, rather it's a different Workflow and Executor on top of the same framework. That composability is what lets us rapidly implement new routing optimizations as the field moves. And the field is moving fast: disaggregated prefill/decode, KV cache-aware placement, and multi-step execution coordination have all gone from research ideas to production requirements in under two years. Part 3, written by Aayush Deshpande, Deep Dhillon, Alexandr Nikitin, and Michael Dunn-O'Connor, covers the decision layer in full: how the five stages work, how plugins share typed state without coupling, how the framework validates compositions at build time, and how the Selector / Workflow / Executor abstraction handles multi-pod execution. Full article: https://lnkd.in/edi5mTWy

Modular: Why LLM Inference Needs a New Kind of Router - Part 3 modular.com

Like Comment Share

Browse jobs

Funding

Modular 3 total rounds

Last Round

Series C Oct 24, 2025

US$ 250.0M

Investors

US Innovative Technology Fund + 3 Other investors

See more info on crunchbase

Modular

Software Development

Enable AI to be used by anyone, anywhere

About us

Locations

Employees at Modular

Denali Lumma

Fabio Riccardi

Himanshu Awasthi

Harish Patil

Updates

Join now to see what you are missing

Similar pages

Veem

Synthesia

mabl

Scope3

bolt.new

Tala

Copper

BridgeBio

Voltron Data

Snorkel AI

Browse jobs

Engineer jobs

Software Engineer jobs

Graduate Recruiter jobs

Developer jobs

Game Developer jobs

Machine Learning Engineer jobs

Senior Software Engineer jobs

Full Stack Engineer jobs

Product Designer jobs

Director Data Science jobs

Intern jobs

Graduate jobs

Project Manager jobs

Staff Engineer jobs

Vice President of Quality jobs

Analyst jobs

Software Automation Engineer jobs

Product Manager jobs

Quantitative Developer jobs

Assurance Specialist jobs

Funding