Maxime Bonnesoeur’s Post

4mo

𝗟𝗟𝗠 𝗚𝘂𝗮𝗿𝗱, 𝗮 𝗾𝘂𝗶𝗰𝗸 𝘄𝗮𝘆 𝘁𝗼 𝗿𝗲𝗺𝗼𝘃𝗲 𝗣𝗜𝗜 𝗮𝗻𝗱 𝗽𝗼𝘁𝗲𝗻𝘁𝗶𝗮𝗹𝗹𝘆 𝗵𝗮𝗿𝗺𝗳𝘂𝗹 𝗽𝗿𝗼𝗺𝗽𝘁𝘀 𝗪𝗵𝗮𝘁 𝗽𝗿𝗼𝗯𝗹𝗲𝗺 𝗱𝗼𝗲𝘀 𝗶𝘁 𝘀𝗼𝗹𝘃𝗲 ? • Blocks prompt injection before the request hits the model. • Scrubs output for PII, secrets, toxicity, bias. • Gives structured JSON verdicts (𝘢𝘭𝘭𝘰𝘸, 𝘴𝘢𝘯𝘪𝘵𝘪𝘻𝘦, 𝘣𝘭𝘰𝘤𝘬). 𝗪𝗵𝘆 𝗜 𝗹𝗶𝗸𝗲 𝗶𝘁 😇 1️⃣ Uses a stack of small, task‑specific BERT classifiers → runs on commodity CPUs, or GPUs if you have them. 2️⃣ Benchmarks show ~200 ms average latency on an AWS m5.xlarge CPU and single‑digit milliseconds on a G5 GPU with ONNX. 3️⃣ Specialized prompt‑injection model scores 𝟬.𝟵𝟳 𝗙𝟭 vs 0.81 for a compact LLM baseline -> understand better perf yet fast inferences ). 4️⃣ Open‑source and pip‑installable; the maintainers push frequent version bumps and new scanners, so we can just `pip install -U` in CI . ---- 𝗦𝗼𝘂𝗿𝗰𝗲𝘀 • LLM Guard docs – Index page (protectai.github.io (https://lnkd.in/eBXvWGxq)) • LLM Guard docs – Prompt Injection benchmarks (protectai.github.io (https://lnkd.in/e-j8eR67)) • Protect AI blog – 𝘚𝘱𝘦𝘤𝘪𝘢𝘭𝘪𝘻𝘦𝘥 𝘔𝘰𝘥𝘦𝘭𝘴 𝘉𝘦𝘢𝘵 𝘚𝘪𝘯𝘨𝘭𝘦 𝘓𝘓𝘔𝘴 𝘧𝘰𝘳 𝘈𝘐 𝘚𝘦𝘤𝘶𝘳𝘪𝘵𝘺 (protectai.com (https://lnkd.in/eSBuVMVc)) • GitHub – protectai/llm‑guard (github.com (https://lnkd.in/eFKtYVwR)) Have you already wrapped your LLMs with guardrails? What gaps are you still seeing/framework do you recommend? #LLMSecurity #GenAI #AIEngineering #MLOps #OpenSource

6 Comments

Patrick Fleith

4mo

It makes a lot of sense to use "small" classifiers. Text classification will never die.

2 Reactions

Shoumik Gandre

3mo

Love this! Your post is my first look at the project, and it immediately brought me back to a past role where I worked with similar HAP filters. Input filtering is always the smooth sailing part. However, having a streaming output scanner that is good at flagging unwanted outputs is challenging. Excited to see how this package pushes the capabilities further. Are there any real-world use cases you're most proud of so far?

1 Reaction

Albert L Groenendijk

4mo

How is multilingual performance?

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Jesse Williams
2mo
Report this post
Just read a deep dive into building a production-grade ML inference platform on Intel hardware (by Kalin Daskalov), the exact the type of real-world implementation that shows how practitioners are solving actual MLOps challenges. The workflow he's built solid: 📦 Model Packaging with KitOps: Instead of struggling with multi-GB models in Git or mounting host paths, he packages models as OCI artifacts (ModelKits) and pushes them to Harbor 🎯 Storage Strategy: MinIO for blob storage backing Harbor, all pinned to the GPU node to minimize cross-node traffic, solving the "big model, slow network" problem ⚙️ GPU Management via DRA: Using Kubernetes Dynamic Resource Allocation with Intel resource drivers for proper GPU sharing, no more fighting with device plugins 🚀 Automated Deployment: Models packaged with kit pack and kit push commands, then served via vLLM with KitOps as init containers that unpack models before serving Key technical wins: - Power-efficient Intel N100 nodes (6W TDP!) running k3s 24/7 - Distributed storage with Longhorn for persistence - Complete GitOps workflow with Flux pulling from Harbor - OpenWebUI frontend → vLLM serving → KitOps model delivery By treating models as first-class OCI artifacts with KitOps, he achieves atomic versioning, efficient storage through deduplication, and seamless Kubernetes integration. Full post with architecture diagrams and code: https://lnkd.in/eQrSvz2T #MLOps #KitOps #Kubernetes #IntelAI #CNCF #ModelOps

Lab Notes: Intel AI Inference Platform MVP blog.sonda.red
Like Comment
To view or add a comment, sign in
Stonebit s.r.l.

80 followers
2mo
Report this post
Encryption is not what it seems. Data is safe at rest, safe in transit, but never safe in use. CPUs always see the truth. Every time you run a program, your data gets decrypted. Even the best secure enclaves cannot change this. The processor must see the real numbers to do math. That is the weak spot. But math has a wild answer. Fully Homomorphic Encryption (FHE) lets you compute on encrypted data. You can add, multiply, and process numbers without ever seeing the real thing. The math stays hidden. The answers come out right. This idea is not new. In 2009, FHE was a dream that crawled. Each step took forever. One simple operation took half an hour. That is a trillion times slower than normal. But the world did not stop. In 2025, FHE is a million times faster. It is still not instant, but it is real. You can try it today. Google’s HEIR project lets you play with FHE in Python. You can see the future with your own hands. The race is on. Zama, Cornami, and Belfort Labs are building the next wave. They use software, GPUs, FPGAs, and even custom chips. Each group is fighting to make FHE fast enough for the real world. This is not just about speed. It is about trust. It is about a world where your data stays private, even when it is being used. No more leaks. No more weak spots. Only math, done in the dark. The old rules are breaking. The future is being built now. The next leap in privacy is here.
Like Comment
To view or add a comment, sign in
Peter Ansah
1mo
Report this post
🚀 Blog Friday Usually, when we talk about program performance, we talk about Data Structures and Algorithms. "O(n) is better than O(n²)" What if I tell you that's not the full story? Modern CPUs are incredibly fast, but without understanding the execution model of programs, we leave a lot of gains on the table. In my latest post, I wrote about how accessing your program's data can impact performance. If you've ever wondered why some data structures are cache-friendly while others are not, this one for you. 👉 Read here: https://lnkd.in/eBG5QfnU #Performance #Programming #ComputerArchitecture #Coding #Technology #SoftwareEngineering

Wait Kills | pansah pansah.com

1 Comment
Like Comment
To view or add a comment, sign in
Ravi Siliveru
2mo
Report this post
🚀 Meet oLLM: Enabling 100K-Token Contexts on Just an 8 GB GPU Amid all the buzz about quantization, pruning, and multi-GPU setups, here’s a refreshing design pivot: oLLM, a lightweight Python library, pushes large-context LLM inference into the “affordable GPU” realm using SSD offload — no quantization required. 🎯 Why oLLM is a Game Changer Ultra-long context w/o VRAM blowup It streams layer weights and attention key/value caches from SSD to GPU, keeping VRAM within 8–10 GB even for 100K token contexts. No need for quantization compromises oLLM retains FP16/BF16 precision, trading off speed for accessibility rather than accuracy loss. Supports big models on modest hardware Examples include Llama-3 (1B / 3B / 8B), GPT-OSS 20B, Qwen3-Next-80B (sparse MoE). Transparent trade-offs = control You’ll pay in SSD I/O and latency. Throughput is modest (e.g. ~0.5 tokens/sec on Qwen-80B with 50K context), but for offline, large-document tasks (summarization, logs analysis, compliance) it’s extremely compelling. 🛠️ How It Works (in a Nutshell) Memory offload to SSD Weights and KV cache bypass host RAM and are stored on fast NVMe SSD, streamed or paged as needed. FlashAttention-2 & chunked MLPs These speedups reduce peak memory demands so the full attention matrix never fully materializes. DiskCache with efficient I/O Uses GPUDirect and fast I/O frameworks (e.g. KvikIO / cuFile) to reduce I/O latency overhead. ✅ Use Cases Where oLLM Shines 📄 Large-document summarization & compliance review Process books, legal documents, logs — whole context, not split windows. 🔎 Knowledge retrieval / RAG over huge corpora Full passage context rather than batching windows. 📊 Offline analytics / batch inference Where throughput is less critical than context fidelity. 💡 ML research & experimentation For anyone wanting to push context limits without needing server-grade hardware. 🔍 Key Limitations & Trade-Offs (Be Realistic) Latency & throughput are not production-grade Heavy I/O overhead means it's not ideal for real-time services. Storage load is heavy Long contexts demand massive SSD space (e.g., 100K tokens → tens to hundreds of GB of KV). Hardware dependencies You’ll need fast NVMe SSDs and GPUs with good I/O paths. Not a drop-in replacement For ultra-high throughput or low-latency production use, conventional multi-GPU pipelines still dominate. 📌 TL;DR (for the scroll-haters) 🚀 oLLM lets you run massive context LLMs offline on a single 8 GB GPU by offloading memory to SSD — no quantization needed. Great for deep dives, long content tasks, and pushing what’s possible on modest hardware. If you’re tinkering with long-context applications and don’t have access to server farms, this is one of the most interesting libraries to explore right now.
Like Comment
To view or add a comment, sign in
Ravichandran Paramasivam
2mo
Report this post
The GPU Memory Ladder (fast → slow) - What it is, how fast it is, and how to use it. GPUs run thousands of threads. Performance is won or lost by how many times you touch off-chip memory and how much you can reuse on-chip. Here's the memory ladder you should design for. 1) Registers: per thread, on-chip (fastest) What: Every thread's private variables live here. Typical size: Big pool per SM (e.g., ~64–256 KB 32-bit regs/SM), divided across threads. Latency: ~1–2 cycles (effectively "free"). Gotchas: If you use too many, the compiler spills to "local memory" (which is actually global VRAM) → hundreds of cycles. Tips: Keep register count reasonable, reuse values, and unroll just enough to add ILP without causing spills. 2) Shared Memory: per block, on-chip, programmer-managed SRAM What: A tiny scratchpad you explicitly load/store. Typical size: Tens to low-hundreds of KB per SM (e.g., ~64–228 KB configurable with L1 on recent GPUs). Latency: ~10–30 cycles when conflict-free (very fast and predictable). Gotchas: Bank conflicts serialize accesses (easy 2–8× slowdowns). Exclusive to a block (not visible to other blocks). Tips: Tile hot data into shared memory, add padding/skews to avoid bank conflicts, and double-buffer with async copies to overlap memory and math. 3) L1 Cache: per SM, hardware-managed What: Small cache in front of each SM's load/store units (often unified with texture/L1/shared pool). Typical size: Tens to low-hundreds of KB per SM (configurable with shared memory). Latency: ~20–60 cycles on hits. Gotchas: Not coherent across SMs, the L2 is the first point of global coherence. Rely on L1 only for your SM’s locality. Tips: Feed it coalesced loads/stores (threads t read addresses base + t) to form wide bursts; avoid random per-thread strides. 4) L2 Cache: chip-wide, coherence point What: Large shared cache for all SMs. global atomics/visibility resolve here. Typical size: Multi-MB to tens of MB (e.g., ~10–50+ MB on recent datacenter GPUs). Latency: ~100–200+ cycles on hits (architecture-dependent). Gotchas: If your working set thrashes L2, you'll pound HBM. Cross-SM communication must pass through L2 (use atomics/fences). Tips: Keep hot weights/params in L2 by reusing them across kernels; some platforms offer persisting L2 hints/windows for repeatedly reused data. 5) Global Memory = VRAM/HBM, off-chip DRAM (huge bandwidth, high latency) What: Main device memory (GDDR/HBM). Enormous bandwidth but far away. Typical bandwidth: Hundreds of GB/s up to multiple TB/s (HBM). Latency: Hundreds of cycles (commonly ~300–800+), easily the dominant stall. Gotchas: Small, scattered, or uncoalesced accesses turn TB/s into a trickle. Spills and "local memory" live here too. Tips: Coalesce aggressively, tile into shared memory, prefetch with async global→shared copies, and fuse kernels so data stays on-chip longer. 6) Host Memory (CPU DRAM via PCIe/NVLink): Latency: microseconds. Tips: fuse kernels so data stays on-chip longer.
6 Comments
Like Comment
To view or add a comment, sign in
Samriddhi Bhatnagar
1mo
Report this post
Excited to share a new blog post written by my colleague, Umesh chaudhary, on serving open-source LLMs using a multi-GPU host. The article details the hardware prerequisites, software installation (including the NVIDIA toolkit), and the process for deploying and testing the models. A valuable read for MLOps and Data Engineering professionals. Find the full article on Medium: https://lnkd.in/dFWNcH6a #LLM #OpenSourceAI #GPUComputing #MLOps #DataEngineering #Tech

Serving the Open Source LLMs On Multi-GPU Host Setup medium.com
Like Comment
To view or add a comment, sign in
Shivank Goel
2mo
Report this post
LLM inference infra (https://lnkd.in/gNDaQrzg) is evolving fast. What used to be “just scale GPUs” has become a game of routing, caching, and memory transfers at cluster scale. 1/ Routing A Radix Tree (compact prefix tree) is a Trie variant that merges single-child paths, reducing memory usage significantly compared to standard Tries. (Good explainer here: https://lnkd.in/ggxeGhBA). Radix Trees are now used to compute prefix-based KV cache hit rates. This enables KV-aware routing: sending a request to the worker most likely to already hold the right cache blocks, while still balancing load to avoid hotspots. Not every query needs a dedicated prefill step—if the decoder has a high prefix hit probability, the prompt is short, or prefill workers are saturated, it can be more efficient to prefill locally inside the decode worker. 2/ Independent Scaling Scaling is no longer just about QPS. A surge in long input sequences stresses prefills more than decoders, so the system must scale each tier independently. That only works if both Prefill and Decode workers are stateless with respect to KV ownership. Each decoder, on startup, allocates its KV blocks (reserved chunk of GPU / CPU memory that the worker sets aside exclusively for storing those KV tensors) and publishes their descriptors into a distributed store (e.g., etcd). Prefill workers lazily fetch these descriptors when they first interact with a decoder, cache them locally, and thereafter exchange only compact block IDs during handoff. This design makes both tiers elastic: you can add or remove prefill/decoder workers at runtime without disrupting active sessions. 3/ Caching Inference systems now look like CPUs in miniature, juggling a memory hierarchy: GPU caches (fast but small), CPU memory (larger but slower), and cross-device links (PCIe, NVLink, RDMA). Think of this as L1/L2/L3 caching logic, but applied to distributed GPU memory pools. 4/ Pub/Sub and Direct Reads Optimizations focus on reducing redundant transfers and speeding up prefill→decode KV cache handoffs. Communication between prefills and decoders increasingly relies on in-memory pub/sub for request coordination. But the heavy data movement is handled via RDMA (Remote Direct Memory Access). With RDMA, a prefill worker can directly read KV blocks from a decode worker’s GPU memory (or vice versa) without involving the remote CPU. That turns what would have been an RPC + copy into a zero-copy memory operation over the network fabric, cutting latency and freeing CPUs to do… nothing at all. The next generation of infra is as much about clever systems design as it is about FLOPs!!!
Like Comment
To view or add a comment, sign in
ScitiX

61 followers
1mo
Report this post
⚠️ 𝐓𝐡𝐞 𝐬𝐢𝐥𝐞𝐧𝐭 𝐤𝐢𝐥𝐥𝐞𝐫𝐬 𝐨𝐟 𝐆𝐏𝐔 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞. Ever seen utilization stuck at 40% even when training runs smoothly? It’s almost never the GPU’s fault — the real killer hides in your data pipeline. Here’s where performance silently disappears 👇 1️⃣ DataLoader: Slow I/O, tiny prefetch buffer, or too few workers. 2️⃣ Storage: Reading from slow network drives or millions of tiny files. 3️⃣ Parsing: CPU wasted unpacking large JSONs or complex data. 4️⃣ CPU→GPU Copy: The jam when tensors move into device memory. 5️⃣ Logging/Saving: Blocking writes that stall the whole loop. ⚡ Stop letting your hardware wait. Feed your GPU: 1️⃣ Prefetch and overlap I/O with compute 2️⃣ Use parallel DataLoader workers 3️⃣ Profile before optimizing Find us → ScitiX We share daily insights on #AIInfrastructure — turning GPU utilization into real throughput, from clusters to inference. #GPU #AIInfra #LLM #MLOps #PerformanceEngineering #ComputeStack #Scitix
Like Comment
To view or add a comment, sign in
Seif Bassem
1mo
Report this post
🚀 Self-Hosting LLMs on Kubernetes - Part 2: How LLMs and GPUs Work Before you can optimize LLM inference on Kubernetes, you need to understand what's actually happening under the hood. In Part 2 of my series, I break down the fundamentals: 🔹 How LLMs store knowledge in billions of parameters (weights) 🔹 Why GPUs outperform CPUs for inference 🔹 The token generation loop: Prefill vs. Decode stages 🔹 What the KV Cache is and why it's critical for performance 🔹 How matrix multiplication powers every prediction This isn't just theory, understanding these concepts even in a high-level is essential when you're making infrastructure decisions about: GPU selection and sizing, scaling strategies and cost optimization. If you're planning to self-host LLMs or just want to understand what happens when you hit "send" on a prompt, this post bridges the gap between AI concepts and infrastructure reality. 🔗 Read Part 2: https://lnkd.in/d4k_5rmf #AI #Kubernetes #LLM #GPUs #AIInfrastructure #MachineLearning #DevOps #AzureAI #vLLM

Self-Hosting LLMs on Kubernetes: How LLMs and GPUs Work? seifbassem.com
Like Comment
To view or add a comment, sign in