Refactor backend switch logic with support for SGLang and OpenAI backends #2286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft

Copilot wants to merge 3 commits into main from copilot/refactor-backend-switch-logic

+369 −133

docs/en/advanced_guides/accelerator_intro.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,15 +1,17 @@
  
    # Accelerate Evaluation Inference with vLLM or LMDeploy

    # Accelerate Evaluation Inference with vLLM, LMDeploy, SGLang, or OpenAI

    ## Background

    During the OpenCompass evaluation process, the Huggingface transformers library is used for inference by default. While this is a very general solution, there are scenarios where more efficient inference methods are needed to speed up the process, such as leveraging VLLM or LMDeploy.

    During the OpenCompass evaluation process, the Huggingface transformers library is used for inference by default. While this is a very general solution, there are scenarios where more efficient inference methods are needed to speed up the process, such as leveraging VLLM, LMDeploy, SGLang, or OpenAI-compatible APIs.

    - [LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit designed for compressing, deploying, and serving large language models (LLMs), developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.

    - [vLLM](https://github.com/vllm-project/vllm) is a fast and user-friendly library for LLM inference and serving, featuring advanced serving throughput, efficient PagedAttention memory management, continuous batching of requests, fast model execution via CUDA/HIP graphs, quantization techniques (e.g., GPTQ, AWQ, SqueezeLLM, FP8 KV Cache), and optimized CUDA kernels.

    - [SGLang](https://github.com/sgl-project/sglang) is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.

    - **OpenAI-compatible APIs** allow you to use any OpenAI-compatible endpoint for model inference, including official OpenAI models or self-hosted models with OpenAI-compatible API interfaces.

    ## Preparation for Acceleration

    First, check whether the model you want to evaluate supports inference acceleration using vLLM or LMDeploy. Additionally, ensure you have installed vLLM or LMDeploy as per their official documentation. Below are the installation methods for reference:

    First, check whether the model you want to evaluate supports inference acceleration using vLLM, LMDeploy, SGLang, or OpenAI-compatible APIs. Additionally, ensure you have installed the required backend as per their official documentation. Below are the installation methods for reference:

    ### LMDeploy Installation Method

    @@ -27,11 +29,27 @@ Install vLLM using pip or from [source](https://vllm.readthedocs.io/en/latest/ge
  
    pip install vllm

    ```

    ## Accelerated Evaluation Using VLLM or LMDeploy

    ### SGLang Installation Method

    Install SGLang using pip or from [source](https://github.com/sgl-project/sglang):

    ```bash

    pip install sglang

    ```

    ### OpenAI API Setup

    For OpenAI-compatible APIs, you only need to install the openai package:

    ```bash

    pip install openai

    ```

    ## Accelerated Evaluation Using VLLM, LMDeploy, SGLang, or OpenAI

    ### Method 1: Using Command Line Parameters to Change the Inference Backend

    OpenCompass offers one-click evaluation acceleration. During evaluation, it can automatically convert Huggingface transformer models to VLLM or LMDeploy models for use. Below is an example code for evaluating the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model:

    OpenCompass offers one-click evaluation acceleration. During evaluation, it can automatically convert Huggingface transformer models to VLLM, LMDeploy, SGLang, or OpenAI models for use. Below is an example code for evaluating the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model:

    ```python

    # eval_gsm8k.py

    @@ -68,21 +86,33 @@ To evaluate the GSM8k dataset using the default Huggingface version of the llama
  
    python run.py config/eval_gsm8k.py

    ```

    To accelerate the evaluation using vLLM or LMDeploy, you can use the following script:

    To accelerate the evaluation using vLLM, LMDeploy, SGLang, or OpenAI, you can use the following script:

    **Using vLLM:**

    ```bash

    python run.py config/eval_gsm8k.py -a vllm

    ```

    or

    **Using LMDeploy:**

    ```bash

    python run.py config/eval_gsm8k.py -a lmdeploy

    ```

    **Using SGLang:**

    ```bash

    python run.py config/eval_gsm8k.py -a sglang

    ```

    **Using OpenAI API:**

    ```bash

    python run.py config/eval_gsm8k.py -a openai

    ```

    Note: For OpenAI backend, you may need to configure additional parameters in your model config, such as `openai_api_base` and `api_key`.

    ### Method 2: Accelerating Evaluation via Deployed Inference Acceleration Service API

    OpenCompass also supports accelerating evaluation by deploying vLLM or LMDeploy inference acceleration service APIs. Follow these steps:

    OpenCompass also supports accelerating evaluation by deploying vLLM, LMDeploy, or SGLang inference acceleration service APIs. Follow these steps:

    1. Install the openai package:

docs/zh_cn/advanced_guides/accelerator_intro.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,15 +1,17 @@
  
    # 使用 vLLM 或 LMDeploy 来一键式加速评测推理

    # 使用 vLLM、LMDeploy、SGLang 或 OpenAI 来一键式加速评测推理

    ## 背景

    在 OpenCompass 评测过程中，默认使用 Huggingface 的 transformers 库进行推理，这是一个非常通用的方案，但在某些情况下，我们可能需要更高效的推理方法来加速这一过程，比如借助 VLLM 或 LMDeploy。

    在 OpenCompass 评测过程中，默认使用 Huggingface 的 transformers 库进行推理，这是一个非常通用的方案，但在某些情况下，我们可能需要更高效的推理方法来加速这一过程，比如借助 VLLM、LMDeploy、SGLang 或 OpenAI 兼容的 API。

    - [LMDeploy](https://github.com/InternLM/lmdeploy) 是一个用于压缩、部署和服务大型语言模型（LLM）的工具包，由 [MMRazor](https://github.com/open-mmlab/mmrazor) 和 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 团队开发。

    - [vLLM](https://github.com/vllm-project/vllm) 是一个快速且易于使用的 LLM 推理和服务库，具有先进的服务吞吐量、高效的 PagedAttention 内存管理、连续批处理请求、CUDA/HIP 图的快速模型执行、量化技术（如 GPTQ、AWQ、SqueezeLLM、FP8 KV Cache）以及优化的 CUDA 内核。

    - [SGLang](https://github.com/sgl-project/sglang) 是一个为大型语言模型（LLM）设计的结构化生成语言。它使您与模型的交互更快、更可控。

    - **OpenAI 兼容 API** 允许您使用任何 OpenAI 兼容的端点进行模型推理，包括官方 OpenAI 模型或自托管的具有 OpenAI 兼容 API 接口的模型。

    ## 加速前准备

    首先，请检查您要评测的模型是否支持使用 vLLM 或 LMDeploy 进行推理加速。其次，请确保您已经安装了 vLLM 或 LMDeploy，具体安装方法请参考它们的官方文档，下面是参考的安装方法：

    首先，请检查您要评测的模型是否支持使用 vLLM、LMDeploy、SGLang 或 OpenAI 兼容 API 进行推理加速。其次，请确保您已经安装了所需的后端，具体安装方法请参考它们的官方文档，下面是参考的安装方法：

    ### LMDeploy 安装方法

    @@ -27,11 +29,27 @@ pip install lmdeploy
  
    pip install vllm

    ```

    ## 评测时使用 VLLM 或 LMDeploy

    ### SGLang 安装方法

    使用 pip 或从 [源码](https://github.com/sgl-project/sglang) 安装 SGLang：

    ```bash

    pip install sglang

    ```

    ### OpenAI API 设置

    对于 OpenAI 兼容的 API，您只需要安装 openai 包：

    ```bash

    pip install openai

    ```

    ## 评测时使用 VLLM、LMDeploy、SGLang 或 OpenAI

    ### 方法1：使用命令行参数来变更推理后端

    OpenCompass 提供了一键式的评测加速，可以在评测过程中自动将 Huggingface 的 transformers 模型转化为 VLLM 或 LMDeploy 的模型，以便在评测过程中使用。以下是使用默认 Huggingface 版本的 llama3-8b-instruct 模型评测 GSM8k 数据集的样例代码：

    OpenCompass 提供了一键式的评测加速，可以在评测过程中自动将 Huggingface 的 transformers 模型转化为 VLLM、LMDeploy、SGLang 或 OpenAI 的模型，以便在评测过程中使用。以下是使用默认 Huggingface 版本的 llama3-8b-instruct 模型评测 GSM8k 数据集的样例代码：

    ```python

    # eval_gsm8k.py

    @@ -68,29 +86,41 @@ models = [
  
    python run.py config/eval_gsm8k.py

    ```

    如果需要使用 vLLM 或 LMDeploy 进行加速评测，可以使用下面的脚本：

    如果需要使用 vLLM、LMDeploy、SGLang 或 OpenAI 进行加速评测，可以使用下面的脚本：

    **使用 vLLM：**

    ```bash

    python run.py config/eval_gsm8k.py -a vllm

    ```

    或

    **使用 LMDeploy：**

    ```bash

    python run.py config/eval_gsm8k.py -a lmdeploy

    ```

    **使用 SGLang：**

    ```bash

    python run.py config/eval_gsm8k.py -a sglang

    ```

    **使用 OpenAI API：**

    ```bash

    python run.py config/eval_gsm8k.py -a openai

    ```

    注意：对于 OpenAI 后端，您可能需要在模型配置中配置额外的参数，如 `openai_api_base` 和 `api_key`。

    ### 方法2：通过部署推理加速服务API来加速评测

    OpenCompass 还支持通过部署vLLM或LMDeploy的推理加速服务 API 来加速评测，参考步骤如下：

    OpenCompass 还支持通过部署 vLLM、LMDeploy 或 SGLang 的推理加速服务 API 来加速评测，参考步骤如下:

    1. 安装openai包：

    ```bash

    pip install openai

    ```

    2. 部署 vLLM 或 LMDeploy 的推理加速服务 API，具体部署方法请参考它们的官方文档，下面以LMDeploy为例：

    2. 部署 vLLM、LMDeploy 或 SGLang 的推理加速服务 API，具体部署方法请参考它们的官方文档，下面以 LMDeploy 为例：

    ```bash

    lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333

opencompass/cli/main.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -56,8 +56,8 @@ def parse_args():
  
                            default=False)

        parser.add_argument(

            '-a', '--accelerator',

            help='Infer accelerator, support vllm and lmdeploy now.',

            choices=['vllm', 'lmdeploy', None],

            help='Infer accelerator backend. Supports: vllm, lmdeploy, sglang, openai.',

            choices=['vllm', 'lmdeploy', 'sglang', 'openai', None],

            default=None,

            type=str)

        parser.add_argument('-m',

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor backend switch logic with support for SGLang and OpenAI backends #2286

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Refactor backend switch logic with support for SGLang and OpenAI backends #2286

Are you sure you want to change the base?

Refactor backend switch logic with support for SGLang and OpenAI backends #2286

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!