llama.vim: speculative fim #31

VJHack · 2025-01-25T01:43:57Z

I propose an implementation for speculative fill in the middle. We now take advantage of the time the server is sitting idle while a suggestion is displayed to fetch the next completion assuming the user accepts the current one. This gives the illusion of (near) zero-latency suggestions to provide a better user experience.

It works by calculating the prefix, suffix, and prompt assuming the user accepts the current suggestion. Apart from a new prefix, suffix, and prompt, it is nearly identical to llama#fim(). After a new fim completion is fetched, we cache it instead of displaying it to the user. So when they accept the current suggestion, the next completion will be a cache hit.

I'm aware that a lot of the code in llama#fim() is duplicated in llama#speculate(). In most cases I would say it's better to modularize code, but here I think having two separate implementations makes it more readable, maintainable, and less messy.

I'm open to suggestions on modularizing and implmention. TAB, TAB, TAB!

Screen.Recording.2025-01-24.at.7.22.55.PM.mov

ggerganov · 2025-01-25T12:00:33Z

I tested this for a bit but it does not seem to work correctly. For example:

I just got a suggestion:

The full completion here from the logs is:

{
    "content":"esult = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n       "
...
}

The speculative request again from the logs is:

{
    "input_suffix": "\n\n",
    "input_prefix": "#include <cstdio>\n\n// find the first n primes, read n from command line\nint main(int argc, char *argv[]) {\n    if (argc != 2) {\n        printf(\"Usage: %s <n>\n\", argv[0]);\n        return 1;\n    }\n    int r\n    int result = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n",
    "top_p": 0.99,
    "input_extra": [
        {
            "filename": "test.cpp",
            "time": [
                446730,
                1939646170
            ],
            "text": "#include <cstdio>\n\n// print hello\nint main(int argc, char *argv[]) {\n    printf(\"\");\n    return 0;\n}\n"
        }
    ],
    ...
    "prompt": "       ",
    "cache_prompt": true
}

This is correct. However accepting the suggestion above does not lead to a cache hit:

It sent this request instead:

{
    "input_suffix": "\n\n",
    "input_prefix": "#include <cstdio>\n\n// find the first n primes, read n from command line\nint main(int argc, char *argv[]) {\n    if (argc != 2) {\n        printf(\"Usage: %s <n>\n\", argv[0]);\n        return 1;\n    }\n    int result = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n       \n        if (is_prime(i)) {\n            result++;\n        }\n",
    ...
    "prompt": "    ",
    "cache_prompt": true
}

So something is not OK.

ggerganov · 2025-01-25T12:06:09Z

I'm open to suggestions on modularizing and implemention.

In general, I think the implementation should be improved like this:

Server-communication logic
- The client sends async requests
- Each received response is processed and stored in the cache
- Triggers display events when changes in the cache occur
Display logic
- Upon display events, get the local cursor state and search for it in the cache
- If it finds a match: display it as a suggestion
- If it does not: do nothing

Such implementation will allow us to implement more complex caching and prediction mechanisms because the rendering will be decoupled from the server communication. Currently, this is not the case, and there are a lot of places where we can make a mistake.

ggerganov · 2025-03-09T15:17:21Z

I think #53 should work. Will be testing it in the next few days. Any feedback is welcome.

ggerganov · 2025-03-10T14:51:17Z

Superseded #53

VJHack added 3 commits January 24, 2025 03:02

basic speculative completion (buggy)

f2a0123

clean up code

1d66809

fixed whitespace

f397a0c

VJHack requested a review from ggerganov January 25, 2025 02:06

VJHack added the help wanted Extra attention is needed label Feb 25, 2025

VJHack marked this pull request as draft March 3, 2025 19:20

This was referenced Mar 8, 2025

core : decouple rendering from server requests #52

Merged

core : add speculative fim #53

Merged

ggerganov closed this Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama.vim: speculative fim #31

llama.vim: speculative fim #31

Uh oh!

VJHack commented Jan 25, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jan 25, 2025

Uh oh!

ggerganov commented Jan 25, 2025 •

edited

Loading

Uh oh!

ggerganov commented Mar 9, 2025

Uh oh!

ggerganov commented Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

llama.vim: speculative fim #31

llama.vim: speculative fim #31

Uh oh!

Conversation

VJHack commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 25, 2025

Uh oh!

ggerganov commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 9, 2025

Uh oh!

ggerganov commented Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VJHack commented Jan 25, 2025 •

edited

Loading

ggerganov commented Jan 25, 2025 •

edited

Loading