Breaking prompt caching

Hey @ezra reading the Letta docs I’m struggling to find an answer so posting here. Modifying memory blocks during a live conversation between agent and user (or agent-to-agent) should come at a latency/$$$ cost, as context inserts and evictions breaks prompt caching. What are Letta’s thoughts on navigating that trade-off?

Good question - this comes up often.

How Letta handles it:
Letta implements prompt caching, and memory block edits only invalidate the cache for that single turn. The invalidation is temporary and localized - the cache recovers on the next turn where no edits occur.

The trade-off in practice:

  • Memory edit turn: Cache miss, slightly higher latency/cost
  • Subsequent turns: Cache hit resumes (assuming no further edits)

Design philosophy:
The Letta/MemGPT view is that the value of learning and memory persistence outweighs the occasional cache miss. An agent that learns from interactions provides more value over time than one optimized purely for per-turn latency.

If you’re cost-sensitive:

  • Use sleeptime agents - they do memory updates asynchronously during idle periods, so the foreground agent maintains cache hits during active conversation
  • Batch memory updates - instruct agents to accumulate observations and write less frequently
  • Use archival memory for less critical info - archival writes don’t affect the context cache since archival isn’t in-context

Relevant docs:

Does this help clarify the trade-off?

1 Like