Modular 26.4 is out. Today's release brings state-of-the-art Mixture-of-Experts serving to Modular Cloud, expands MAX support for the newest open-weight models, and takes another step toward Mojo 1.0. Modular Cloud now supports the latest frontier models, including MiniMax's M3, Z.ai's GLM 5.2, and Kimi (Moonshot AI)'s Kimi 2.7. 26.4 also ships enhanced quantization and speculative decoding capabilities, extended Apple silicon GPU support, model bring-up via agent skills, and more. Dive into all the changes: https://lnkd.in/gUcWiigC
About us
The next-generation AI developer platform unifying the development and deployment of AI for the world.
- Website
-
https://www.modular.com
External link for Modular
- Industry
- Software Development
- Company size
- 51-200 employees
- Headquarters
- Everywhere
- Type
- Privately Held
- Founded
- 2022
- Specialties
- machinelearning, ai, software, tensorflow, pytorch, and hardware
Locations
-
Primary
Get directions
Everywhere, US
Employees at Modular
Updates
-
ModCon is back. August 18th. San Francisco. A full day of AI infrastructure talks, launches, and workshops, with speakers including Dylan Patel (SemiAnalysis), Jerry Liu (LlamaIndex), Paige Bailey (Google DeepMind), and Sid Sheth (d-Matrix). Limited spots. Early-bird pricing ends July 1st. Get your ticket: https://lnkd.in/gSUhTSFa
-
Z.ai open-sourced GLM 5.2 today, and Modular is a Day Zero launch partner. GLM 5.2 is their new flagship model, built for coding and long-horizon agentic work, with usable 1M-token context. It's available on Modular Cloud now. Serving long context well is a full-stack problem. As context grows, the KV cache grows with it, and attention and scheduling costs climb fast. The Modular stack optimizes the path from GPU kernels to serving, which lets us run frontier open models on Day 0 with the utilization and economics agent workloads need. Request access on Modular Cloud: https://lnkd.in/g7RMmiCA + learn more about GLM 5.2's benchmarks & architecture: https://z.ai/blog/glm-5.2
-
Tim Davis built a retro platformer where the terrain is generated by SIMD kernel computation. Meet GPU Boost Adventure. A neon, CRT-flavored endless runner: https://boost.modular.com/ A Mojo program generates unique levels 8 SIMD lanes at a time representing the thermal field of a GPU. Heat creates bigger gaps, vents, and choppier terrain (the die is throttling). Coolant adds boost pads and headroom. Two paths feed the game's fields. One precomputes the fields with SIMD. The other compiles a Mojo kernel to WebAssembly and builds them live in your browser. Play it now at https://boost.modular.com/. Explore how it's made at https://lnkd.in/gTd8SHS2. Comment with your high score!
-
-
MiniMax released the M3 open weights today, and Modular is an official Day Zero launch partner. M3 is a hard model to serve well. The architecture is block-sparse GQA attention with a lightweight indexer (MiniMax Sparse Attention) and an MXFP8 mixture-of-experts. This setup is novel enough that running it efficiently requires custom kernels. MSA cuts per-token attention compute to roughly 1/20th of full attention, making a 1M-token context economical to serve, not just technically possible. MAX, our inference framework, owns the full stack from GPU kernels to serving. We built the kernels M3 needs and had it running same-day. That full stack ownership is what lets us optimize cost, with higher batch utilization and better economics for long-running agent workloads. M3 supports text, image, and video input. Available on Modular Cloud now by request: https://lnkd.in/ek9-cYCS Read our full announcement: https://lnkd.in/eqWCh2d4
-
Running 400B+ parameter models for real-time patient conversations is a hard problem. Running them reliably across heterogeneous hardware as your infrastructure scales is harder. Hippocratic AI manages both. Their Polaris system orchestrates dozens of specialized models in parallel, contacting tens of thousands of patients per day, with error rates lower than human clinicians. When they evaluated MAX with NVIDIA B300 GPUs, the priorities were TTFT, tail latency, and hardware portability. MAX cleared all of them: sub-500ms mean TTFT, ~30% faster P99 end-to-end latency, and ~22% faster mean end-to-end latency at scale. Because MAX's performance comes from Mojo-native kernels rather than vendor-specific code, the same deployment extends to AMD and other accelerators without rebuilding the stack. Read the full blog post: https://lnkd.in/eZNADHwW
-
Mojo 1.0 Beta is out, and the community is already building on it. Last month: Decimo v0.10.0 with Rust benchmarks, the GNU Scientific Library ported to Mojo, a Kafka client, and more. Catch up with the latest community highlights: https://lnkd.in/e7aVqZu9
-
MiniMax M3 is about to drop as open weights, and we've spent the week getting ready for it. A 1M-token context window, native multimodality, and an architecture that takes serious engineering to serve efficiently. That's the work our inference and kernel teams have been heads-down on, so that M3 runs fast and reliably the moment it's public, rather than weeks later. When the weights land in the next few days, you'll be able to run it on Modular. Congratulations to the MiniMax team on what's shaping up to be a landmark launch across state-of-the-art coding performance, agentic workflow support, and native multimodal reasoning. Read the full announcement of our day zero M3 support: https://lnkd.in/eqWCh2d4 #AI #Inference #MachineLearning #LLM #MiniMax
-
-
We just dropped a new version of Inkwell, our image gen storybook app. Explore founding stories from your favorite companies, powered by open models and Modular Cloud: https://lnkd.in/ecEYw2kv Your favorite companies already have stories waiting for you: Stripe, OpenAI, Apple. Don't see your company's story? Type in your company's URL and watch it unfold. Inkwell is built on Modular Cloud. Low-latency, high-quality image generation at scale. Comment or DM us to learn more about building with Modular Cloud. A few of our favorite founder stories: Four Researchers, One Question Aravind Srinivas from Perplexity: https://lnkd.in/eN9CuRpV Leaving the Nest Reid Hoffman and Mustafa Suleyman from Inflection AI: https://lnkd.in/ebWGgWaB The 2014 Spark Yair Adato from Bria AI: https://lnkd.in/ef-q5Kb2 A Paper That Felt Like Magic Victor Riparbelli from Synthesia: https://lnkd.in/epEy2qEk If you're deploying image gen at scale, our team wants to hear what you're building: https://lnkd.in/eFnyMp3S
-
-
Inference routing has been one of the most active areas in LLM infrastructure research over the past year. Today we're publishing the final post in our three-part series on how Modular Cloud approaches it. The core idea: instead of shipping a fixed set of routing algorithms, we built a five-stage plugin pipeline (Prepare → Filter → Score → Pick → Execute) where behaviors compose. Session-sticky plus cache-aware routing isn't a new implementation - it's five existing plugins and a config file. A new execution pattern like disaggregated prefill/decode doesn't require a new HTTP handler, rather it's a different Workflow and Executor on top of the same framework. That composability is what lets us rapidly implement new routing optimizations as the field moves. And the field is moving fast: disaggregated prefill/decode, KV cache-aware placement, and multi-step execution coordination have all gone from research ideas to production requirements in under two years. Part 3, written by Aayush Deshpande, Deep Dhillon, Alexandr Nikitin, and Michael Dunn-O'Connor, covers the decision layer in full: how the five stages work, how plugins share typed state without coupling, how the framework validates compositions at build time, and how the Selector / Workflow / Executor abstraction handles multi-pod execution. Full article: https://lnkd.in/edi5mTWy