"docker model run" optimization

Some inference engines are faster to start than others. A typical example of one that is slow to start in vLLM. This is ok for data centres. They are happy to start a service, warm it up properly, etc. and leave it running basically forever. Being optimal after startup is the focus. When we do:

docker model run some_model

today. We only start vllm server once we have completed typing our first query and press enter to send it. This is slow.

What we should do instead is when the user does:

docker model run some_model

We should start the "vllm serve" process immediately in the background, so by the time the user sends the first query, they will have to wait less, it may even be fully loaded at that point. RamaLama does this FWIW (I implemented this there)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"docker model run" optimization #623

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"docker model run" optimization #623

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions