-
Notifications
You must be signed in to change notification settings - Fork 92
Description
Some inference engines are faster to start than others. A typical example of one that is slow to start in vLLM. This is ok for data centres. They are happy to start a service, warm it up properly, etc. and leave it running basically forever. Being optimal after startup is the focus. When we do:
docker model run some_model
today. We only start vllm server once we have completed typing our first query and press enter to send it. This is slow.
What we should do instead is when the user does:
docker model run some_model
We should start the "vllm serve" process immediately in the background, so by the time the user sends the first query, they will have to wait less, it may even be fully loaded at that point. RamaLama does this FWIW (I implemented this there)