You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I propose refactoring main.cpp into a library (llama.cpp, compiled to llama.so/llama.a/whatever) and making main.cpp a simple driver program. A simple C API should be exposed to access the model, and then bindings can more easily be written for Python, node.js, or whatever other language.
Edit: on that note, is it possible to do inference from two or more prompts on different threads? If so, serving multiple people would be possible without multiple copies of model weights in RAM.
The text was updated successfully, but these errors were encountered:
For anyone wanting to do this, see an initial attempt in #77, and in particular this comment on ggerganov's preferred approach. Should be pretty straightforward I think.
I propose refactoring
main.cpp
into a library (llama.cpp
, compiled tollama.so
/llama.a
/whatever) and makingmain.cpp
a simple driver program. A simple C API should be exposed to access the model, and then bindings can more easily be written for Python, node.js, or whatever other language.This would partially solve #82 and #162.
Edit: on that note, is it possible to do inference from two or more prompts on different threads? If so, serving multiple people would be possible without multiple copies of model weights in RAM.
The text was updated successfully, but these errors were encountered: