Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17580

So we can load these natively just like gguf

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

PR #351: Safetensors Support Implementation

This PR introduces 1,663 lines of new code across 11 files to add safetensors format support. The implementation is incomplete and non-functional, with all model loading functions returning "not yet implemented" errors. No existing code paths are modified, resulting in zero performance impact on current operations.

Key Findings

Performance-Critical Areas Impact:

The changes do not affect any performance-critical functions identified in the project summary. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. Model loading functions (llama_model_load_from_file, llama_init_from_model) are unchanged. Memory management (llama_memory_clear, llama_kv_cache operations) and batch processing (llama_batch_init, llama_decode) show no modifications.

Tokens Per Second Impact:

No impact on inference throughput. The tokenization and inference pipeline remains untouched. Functions responsible for token processing (llama_tokenize, llama_detokenize, llama_decode, llama_encode) show no changes in response time or throughput. The reference benchmark (ollama://smollm:135m on 12th Gen Intel i7-1255U) would maintain current tokens per second performance.

Power Consumption Analysis:

Analysis shows a 10.90% increase in estimated power consumption for build.bin.libllama.so (214,109 nJ vs 193,066 nJ baseline, +21,043 nJ absolute change). This increase is attributed to STL container operations showing throughput regressions:

  • std::vector<size_t>::empty() +134 ns throughput
  • std::back_inserter<std::vector> +24 ns throughput
  • std::vector::back() +29 ns throughput
  • std::vector<llm_symbol>::end() +24 ns throughput

Other binaries show minimal changes: llama-tts (+0.07%), llama-gguf-split (+0.03%), llama-quantize (+0.02%), with llama-run (-0.10%) and llama-cvector-generator (-0.08%) showing slight improvements.

Code Implementation Analysis:

The PR adds infrastructure for parsing safetensors files (llama-safetensors.cpp, 398 lines), HuggingFace config parsing (llama-hf-config.cpp, 220 lines), type conversion utilities (llama-safetensors-types.cpp, 157 lines), and tensor name mapping (llama-safetensors-loader.cpp, 271 lines). The model builder (llama-model-from-safetensors.cpp, 218 lines) defines an 8-step loading pipeline but implements only steps 1-3. Steps 4-8 (create_model_structure, allocate_tensors, load_tensor_data, init_vocabulary, finalize_model) return false with error messages.

The implementation uses C-style FILE* operations for file I/O, nlohmann/json for parsing, and std::regex for tensor name mapping. Type conversion functions support F32, F16, BF16, I32, I16, I8 formats with element-wise loops for conversions.

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 5efc8b7 to f077805 Compare November 29, 2025 09:08
So we can load these natively just like gguf

Signed-off-by: Eric Curtin <[email protected]>
@loci-dev loci-dev force-pushed the upstream-PR17580-branch_ericcurtin-support-safetensors branch from ff29a86 to a963646 Compare November 29, 2025 13:37
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #351: Safetensors Support

Overview

PR #351 introduces safetensors model loading capability through 11 new files (2296 lines). This is an additive feature with no modifications to existing inference paths. The performance analysis reveals no impact on runtime inference performance, as the new code affects only the model loading phase.

Key Findings

Inference Performance Impact

No impact on tokens per second. The safetensors loading path does not modify any inference-critical functions:

  • llama_decode - unchanged
  • llama_encode - unchanged
  • llama_tokenize - unchanged

All inference operations use the same GGML backend and tensor structures regardless of whether the model was loaded from GGUF or safetensors format. Once loaded, model execution is identical.

Model Loading Performance

The new safetensors loader exhibits different characteristics compared to GGUF:

Type Conversion Operations:

  • F32→F16 conversion: element-wise loop processing 16M-2B elements per model
  • F64→F32 downcast: similar element-wise processing
  • Direct memcpy for matching types (F32→F32, F16→F16)

For a 7B parameter model, type conversion adds approximately 10000-30000 ms to load time. This is a one-time cost during model initialization and does not affect subsequent inference.

Tensor Name Mapping:

  • Uses std::regex for pattern matching on 300+ tensor names
  • Adds 300-500 ms overhead during load
  • String concatenation for name construction

File I/O Pattern:

  • Sequential fread operations with fseek for tensor data
  • Temporary buffer allocation per tensor (peak memory = model size + largest tensor)
  • No memory-mapped file support

Power Consumption Analysis

Binary-level impact:

  • libllama.so: No change (0.0%) - safetensors code is separate module
  • New binaries: No new executables added, only library code

The safetensors loading functions are not included in the power consumption baseline as they represent new, optional code paths. When active, the loading phase will consume additional CPU cycles for type conversion and file I/O, but this is transient and does not affect steady-state inference power consumption.

Implementation Status

Incomplete vocabulary loading: The init_vocabulary() function returns success without loading tokenizer data. Models load structurally but lack vocabulary for text processing. This affects the usability of safetensors-loaded models but does not impact performance metrics of existing GGUF-based workflows.

Performance-Critical Areas

Model Loading Module:

  • New alternative path for safetensors format
  • Existing GGUF path unchanged
  • No shared code between loaders

Memory Management Module:

  • Uses standard ggml_backend_alloc_ctx_tensors() allocation
  • Same backend buffer management as GGUF
  • No changes to KV cache or memory recurrent systems

Token Processing Module:

  • Vocabulary initialization incomplete
  • No changes to existing tokenization functions
  • Safetensors path does not affect active tokenization performance

The implementation is architecturally sound as an isolated feature addition. The performance characteristics differ from GGUF loading but do not regress existing functionality. Inference performance remains unchanged as the new code operates exclusively in the model initialization phase.

@loci-dev loci-dev force-pushed the main branch 5 times, most recently from e4a4e1d to d0b408b Compare November 30, 2025 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants