Skip to content

Conversation

@VJHack
Copy link
Collaborator

@VJHack VJHack commented Jan 25, 2025

I propose an implementation for speculative fill in the middle. We now take advantage of the time the server is sitting idle while a suggestion is displayed to fetch the next completion assuming the user accepts the current one. This gives the illusion of (near) zero-latency suggestions to provide a better user experience.

It works by calculating the prefix, suffix, and prompt assuming the user accepts the current suggestion. Apart from a new prefix, suffix, and prompt, it is nearly identical to llama#fim(). After a new fim completion is fetched, we cache it instead of displaying it to the user. So when they accept the current suggestion, the next completion will be a cache hit.

I'm aware that a lot of the code in llama#fim() is duplicated in llama#speculate(). In most cases I would say it's better to modularize code, but here I think having two separate implementations makes it more readable, maintainable, and less messy.

I'm open to suggestions on modularizing and implmention. TAB, TAB, TAB!

Screen.Recording.2025-01-24.at.7.22.55.PM.mov

@VJHack VJHack requested a review from ggerganov January 25, 2025 02:06
@ggerganov
Copy link
Member

I tested this for a bit but it does not seem to work correctly. For example:

I just got a suggestion:

image

The full completion here from the logs is:

{
    "content":"esult = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n       "
...
}

The speculative request again from the logs is:

{
    "input_suffix": "\n\n",
    "input_prefix": "#include <cstdio>\n\n// find the first n primes, read n from command line\nint main(int argc, char *argv[]) {\n    if (argc != 2) {\n        printf(\"Usage: %s <n>\n\", argv[0]);\n        return 1;\n    }\n    int r\n    int result = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n",
    "top_p": 0.99,
    "input_extra": [
        {
            "filename": "test.cpp",
            "time": [
                446730,
                1939646170
            ],
            "text": "#include <cstdio>\n\n// print hello\nint main(int argc, char *argv[]) {\n    printf(\"\");\n    return 0;\n}\n"
        }
    ],
    ...
    "prompt": "       ",
    "cache_prompt": true
}

This is correct. However accepting the suggestion above does not lead to a cache hit:

image

It sent this request instead:

{
    "input_suffix": "\n\n",
    "input_prefix": "#include <cstdio>\n\n// find the first n primes, read n from command line\nint main(int argc, char *argv[]) {\n    if (argc != 2) {\n        printf(\"Usage: %s <n>\n\", argv[0]);\n        return 1;\n    }\n    int result = 0;\n    int n = atoi(argv[1]);\n    for (int i = 2; i < n; i++) {\n       \n        if (is_prime(i)) {\n            result++;\n        }\n",
    ...
    "prompt": "    ",
    "cache_prompt": true
}

So something is not OK.

@ggerganov
Copy link
Member

ggerganov commented Jan 25, 2025

I'm open to suggestions on modularizing and implemention.

In general, I think the implementation should be improved like this:

  • Server-communication logic

    • The client sends async requests
    • Each received response is processed and stored in the cache
    • Triggers display events when changes in the cache occur
  • Display logic

    • Upon display events, get the local cursor state and search for it in the cache
    • If it finds a match: display it as a suggestion
    • If it does not: do nothing

Such implementation will allow us to implement more complex caching and prediction mechanisms because the rendering will be decoupled from the server communication. Currently, this is not the case, and there are a lot of places where we can make a mistake.

@VJHack VJHack added the help wanted Extra attention is needed label Feb 25, 2025
@VJHack VJHack marked this pull request as draft March 3, 2025 19:20
@ggerganov
Copy link
Member

I think #53 should work. Will be testing it in the next few days. Any feedback is welcome.

@ggerganov
Copy link
Member

Superseded #53

@ggerganov ggerganov closed this Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

help wanted Extra attention is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants