Skip to content

"docker model run" optimization #623

@ericcurtin

Description

@ericcurtin

Some inference engines are faster to start than others. A typical example of one that is slow to start in vLLM. This is ok for data centres. They are happy to start a service, warm it up properly, etc. and leave it running basically forever. Being optimal after startup is the focus. When we do:

docker model run some_model

today. We only start vllm server once we have completed typing our first query and press enter to send it. This is slow.

What we should do instead is when the user does:

docker model run some_model

We should start the "vllm serve" process immediately in the background, so by the time the user sends the first query, they will have to wait less, it may even be fully loaded at that point. RamaLama does this FWIW (I implemented this there)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions