Running LLMs on Android#
ExecuTorch’s LLM-specific runtime components provide an experimental Java interface around the core C++ LLM runtime, available through the executorch-android AAR.
Prerequisites#
Make sure you have a model and tokenizer files ready, as described in the prerequisites section of the Running LLMs with C++ guide.
To add the executorch-android library to your app, see Using ExecuTorch on Android. The LLM runner classes are bundled inside the same AAR as the generic Module API.
Runtime API#
Once the executorch-android AAR is on your classpath, you can import the LLM runner classes from the org.pytorch.executorch.extension.llm package.
Importing#
import org.pytorch.executorch.extension.llm.LlmModule;
import org.pytorch.executorch.extension.llm.LlmModuleConfig;
import org.pytorch.executorch.extension.llm.LlmGenerationConfig;
import org.pytorch.executorch.extension.llm.LlmCallback;
LlmModule#
The LlmModule class provides a simple Java interface for loading a text-generation model, configuring its tokenizer, generating token streams, and stopping execution. It also supports multimodal models that accept image and audio inputs alongside a text prompt.
This API is experimental and subject to change.
Initialization#
Create an LlmModule by specifying paths to your serialized model (.pte) and tokenizer files. For text-only models, the simple constructor is enough:
LlmModule module = new LlmModule(
"/data/local/tmp/llama-3.2-instruct.pte",
"/data/local/tmp/tokenizer.model",
0.8f);
For finer control (multimodal model type, BOS/EOS handling, supplementary data files, load mode), use LlmModuleConfig with the fluent builder:
LlmModuleConfig config = LlmModuleConfig.create()
.modulePath("/data/local/tmp/llama-3.2-instruct.pte")
.tokenizerPath("/data/local/tmp/tokenizer.model")
.temperature(0.8f)
.modelType(LlmModuleConfig.MODEL_TYPE_TEXT)
.loadMode(LlmModuleConfig.LOAD_MODE_MMAP)
.build();
LlmModule module = new LlmModule(config);
Available load modes are LOAD_MODE_FILE, LOAD_MODE_MMAP (default), LOAD_MODE_MMAP_USE_MLOCK, and LOAD_MODE_MMAP_USE_MLOCK_IGNORE_ERRORS. Available model types are MODEL_TYPE_TEXT, MODEL_TYPE_TEXT_VISION, and MODEL_TYPE_MULTIMODAL.
Construction itself is lightweight and does not load the program data immediately.
Loading#
Explicitly load the model before generation to avoid paying the load cost during your first generate call.
int status = module.load();
if (status != 0) {
// Handle load failure (status is an ExecuTorch runtime error code).
}
If you skip this step, the model is loaded lazily on the first generate call.
Generating#
Generate tokens from a text prompt by passing an LlmCallback that receives each token as it is produced. The same callback also receives a JSON-encoded statistics string when generation completes.
LlmCallback callback = new LlmCallback() {
@Override
public void onResult(String token) {
// Called once per generated token. Append to your UI buffer here.
System.out.print(token);
}
@Override
public void onStats(String statsJson) {
// Called once when generation finishes. See extension/llm/runner/stats.h
// for the field definitions.
System.out.println("\n" + statsJson);
}
@Override
public void onError(int errorCode, String message) {
// Called if the runtime reports an error during generation.
}
};
module.generate("Once upon a time", callback);
For full control over generation parameters, use LlmGenerationConfig:
LlmGenerationConfig genConfig = LlmGenerationConfig.create()
.seqLen(2048)
.temperature(0.8f)
.echo(false)
.build();
module.generate("Once upon a time", genConfig, callback);
LlmGenerationConfig exposes echo, maxNewTokens, seqLen, temperature, numBos, numEos, and warming. Defaults match the C++ GenerationConfig documented in Running LLMs with C++.
Stopping Generation#
If you need to interrupt a long-running generation, call stop() from another thread (or from inside the onResult callback):
module.stop();
Generation also runs synchronously on the calling thread, so make sure you invoke generate() off the main thread (for example, on a HandlerThread or via a java.util.concurrent.Executor).
Resetting#
To clear the prefilled tokens from the KV cache and reset the start position to 0, call:
module.resetContext();
This is the equivalent of reset() on the iOS runner and reset() on the C++ IRunner.
Multimodal Inputs#
For models declared as MODEL_TYPE_TEXT_VISION or MODEL_TYPE_MULTIMODAL, image and audio data are provided through dedicated prefill methods. After prefilling all modalities, call generate() with the text prompt to produce the response.
Images#
Raw uint8 pixel data in CHW order can be supplied as an int[], or as a direct ByteBuffer to avoid JNI array copies:
// As int[]
int[] pixels = ...; // length == channels * height * width
module.prefillImages(pixels, /*width=*/336, /*height=*/336, /*channels=*/3);
// As direct ByteBuffer (preferred for large images)
ByteBuffer buffer = ByteBuffer.allocateDirect(3 * 336 * 336);
buffer.put(rawBytes).rewind();
module.prefillImages(buffer, 336, 336, 3);
Pre-normalized float pixel data is also supported, both as a float[] and as a direct ByteBuffer in native byte order:
float[] normalized = ...; // length == channels * height * width
module.prefillImages(normalized, 336, 336, 3);
ByteBuffer floatBuffer = ByteBuffer
.allocateDirect(3 * 336 * 336 * Float.BYTES)
.order(ByteOrder.nativeOrder());
// fill floatBuffer with normalized values, then:
module.prefillNormalizedImage(floatBuffer, 336, 336, 3);
Audio#
Preprocessed audio features (for example mel spectrograms produced by a Whisper preprocessor) can be supplied as byte[] or float[]:
module.prefillAudio(features, /*batchSize=*/1, /*nBins=*/128, /*nFrames=*/3000);
Raw audio samples can be supplied with prefillRawAudio:
module.prefillRawAudio(samples, /*batchSize=*/1, /*nChannels=*/1, /*nSamples=*/16000);
Generating with Multimodal Prefill#
After prefilling each modality, run generate() with the text prompt as usual:
module.prefillImages(pixels, 336, 336, 3);
module.generate("What's in this image?", callback);
For text-vision models, a convenience overload accepts the image and prompt together:
module.generate(
pixels, /*width=*/336, /*height=*/336, /*channels=*/3,
"What's in this image?",
/*seqLen=*/768,
callback,
/*echo=*/false);
Demo#
See the Llama Android demo app in executorch-examples for an end-to-end project that wires LlmModule, LlmCallback, and a HandlerThread into a chat UI.