AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, Pretraining, & Inference at Scale

Optimizing ML Data Access with Alluxio
Preprocessing, Pretraining, & Inference at
Scale
Bin Fan
Founding Engineer, VP of Technology @ Alluxio
March 6th 2025

About Me
2
Bin Fan
○ Founding Engineer, VP of Technology @ Alluxio
○ Email: binfan@alluxio.com
○ Linkedin: https://www.linkedin.com/in/bin-fan/
○ Previously worked in Google
○ PhD in CS at Carnegie Mellon University

Powered by Alluxio
Zhihu
TELCO & MEDIA
E-COMMERCE
FINANCIAL SERVICES
TECH & INTERNET
OTHERS

4
Alluxio Data Platform
Accelerate data-intensive AI & Analytics workloads

DeepSeek: Redeﬁning Open-Source LLMs
● Performance on Par with SOTA Models Like GPT-4 at a fraction of the cost
● Disrupting the Competitive Landscape
○ Expanding accessibility to much broader audiences
○ Higher bar for upcoming general-purpose LLMs
○ Potentially more possibility on LLMs with private domain adaptation
● A key lesson: great LLMs can be created by small teams with extremely
eﬃcient resource utilization

Engineering/Resource Eﬃciency in Pre-training
Data Lake (All Data)
us-east-1
Training
Distributed Cache
(Alluxio)
…
Fast Access with
Only Hot Data Cached
Only retrieve
Data on Demand
Distributed Cache
(Alluxio)
us-west-1
Training
● High and consistent I/O performance
→ Comparable I/O performance to HPC storage
● Cloud agnostic
→ Easy to extend the prod env to multi-region/cloud
● Transparent Cache Mgmt
→ Avoid repeatedly preparing (same) data, and the
overhead to maintain local storage

LLM Inference: Two Key Metrics
Throughput (System Perspective)
● Measures tokens / sec
● Higher throughput → Better resource utilization, lower system cost
First-time to token (User Perspective)
● Measures time from request submission to the ﬁrst token generation
● < 100ms → Smooth user experience

GPU Memory capacity: Primary Bottleneck
● VRAM is needed for Model Weight & KV-cache
● A typical 13B model inference on A100
● GPT-3 (175B) requires 350GB GPU RAM to load
model weights.
● Large KV-cache is needed for longer context
windows

KV Cache Oﬄoading
● A critical optimization for speeding up Transformer models
○ Signiﬁcantly speeding up text generation by reusing previous context instead of recalculating
attention for all tokens at each step.
○ Example KV Cache systems :
■ LMCache (vLLM Production Stack), MoonCake, etc
● Experimenting Alluxio as a Tiered KV cache
○ Talk to me if you are interested in this
Mooncake

DeepSeek 3FS: High-Performance Parallel Filesystem
● Newly Open-Source Parallel Filesystem by DeepSeek
○ Purpose-Built for RDMA + NVMe hardware
○ Powered by FoundationDB Scalable metadata
○ Achieves 40GB/s per node throughput (8TB/s with 180 nodes)
● Optimized for High-Throughput Workloads
○ Focused on large file read/write performance (not for general-purpose use)
○ Recommended using FFRecord format for efficient small file aggregation

Complementary Technologies
● 3FS: Modern Parallel Filesystem (Similar to GPFS, Lustre)
○ Optimized for I/O-intensive workloads with RDMA + NVMe
● Alluxio: Distributed Caching & Access Layer
○ Bridges Compute & Data Lakes, accelerating I/O workloads
○ Achieves RDMA-comparable read speeds with intelligent caching
○ Provides namespace abstraction & indirection for S3, HDFS, GCP, and more → Cloud-agnostic I/O
● Alluxio can integrate with 3FS, just like S3 or HDFS
○ Enables high-mid-low tiered I/O solutions, allowing applications to optimize performance and cost

AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, Pretraining, & Inference at Scale

More Related Content

Similar to AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, Pretraining, & Inference at Scale (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, Pretraining, & Inference at Scale