-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #17580: Add safetensors support #351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
UPSTREAM PR #17580: Add safetensors support #351
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis SummaryPR #351: Safetensors Support Implementation This PR introduces 1,663 lines of new code across 11 files to add safetensors format support. The implementation is incomplete and non-functional, with all model loading functions returning "not yet implemented" errors. No existing code paths are modified, resulting in zero performance impact on current operations. Key FindingsPerformance-Critical Areas Impact: The changes do not affect any performance-critical functions identified in the project summary. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. Model loading functions (llama_model_load_from_file, llama_init_from_model) are unchanged. Memory management (llama_memory_clear, llama_kv_cache operations) and batch processing (llama_batch_init, llama_decode) show no modifications. Tokens Per Second Impact: No impact on inference throughput. The tokenization and inference pipeline remains untouched. Functions responsible for token processing (llama_tokenize, llama_detokenize, llama_decode, llama_encode) show no changes in response time or throughput. The reference benchmark (ollama://smollm:135m on 12th Gen Intel i7-1255U) would maintain current tokens per second performance. Power Consumption Analysis: Analysis shows a 10.90% increase in estimated power consumption for build.bin.libllama.so (214,109 nJ vs 193,066 nJ baseline, +21,043 nJ absolute change). This increase is attributed to STL container operations showing throughput regressions:
Other binaries show minimal changes: llama-tts (+0.07%), llama-gguf-split (+0.03%), llama-quantize (+0.02%), with llama-run (-0.10%) and llama-cvector-generator (-0.08%) showing slight improvements. Code Implementation Analysis: The PR adds infrastructure for parsing safetensors files (llama-safetensors.cpp, 398 lines), HuggingFace config parsing (llama-hf-config.cpp, 220 lines), type conversion utilities (llama-safetensors-types.cpp, 157 lines), and tensor name mapping (llama-safetensors-loader.cpp, 271 lines). The model builder (llama-model-from-safetensors.cpp, 218 lines) defines an 8-step loading pipeline but implements only steps 1-3. Steps 4-8 (create_model_structure, allocate_tensors, load_tensor_data, init_vocabulary, finalize_model) return false with error messages. The implementation uses C-style FILE* operations for file I/O, nlohmann/json for parsing, and std::regex for tensor name mapping. Type conversion functions support F32, F16, BF16, I32, I16, I8 formats with element-wise loops for conversions. |
5efc8b7 to
f077805
Compare
So we can load these natively just like gguf Signed-off-by: Eric Curtin <[email protected]>
ff29a86 to
a963646
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #351: Safetensors SupportOverviewPR #351 introduces safetensors model loading capability through 11 new files (2296 lines). This is an additive feature with no modifications to existing inference paths. The performance analysis reveals no impact on runtime inference performance, as the new code affects only the model loading phase. Key FindingsInference Performance ImpactNo impact on tokens per second. The safetensors loading path does not modify any inference-critical functions:
All inference operations use the same GGML backend and tensor structures regardless of whether the model was loaded from GGUF or safetensors format. Once loaded, model execution is identical. Model Loading PerformanceThe new safetensors loader exhibits different characteristics compared to GGUF: Type Conversion Operations:
For a 7B parameter model, type conversion adds approximately 10000-30000 ms to load time. This is a one-time cost during model initialization and does not affect subsequent inference. Tensor Name Mapping:
File I/O Pattern:
Power Consumption AnalysisBinary-level impact:
The safetensors loading functions are not included in the power consumption baseline as they represent new, optional code paths. When active, the loading phase will consume additional CPU cycles for type conversion and file I/O, but this is transient and does not affect steady-state inference power consumption. Implementation StatusIncomplete vocabulary loading: The Performance-Critical AreasModel Loading Module:
Memory Management Module:
Token Processing Module:
The implementation is architecturally sound as an isolated feature addition. The performance characteristics differ from GGUF loading but do not regress existing functionality. Inference performance remains unchanged as the new code operates exclusively in the model initialization phase. |
e4a4e1d to
d0b408b
Compare
Mirrored from ggml-org/llama.cpp#17580
So we can load these natively just like gguf