Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 39 additions & 9 deletions docs/en/advanced_guides/accelerator_intro.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
# Accelerate Evaluation Inference with vLLM or LMDeploy
# Accelerate Evaluation Inference with vLLM, LMDeploy, SGLang, or OpenAI

## Background

During the OpenCompass evaluation process, the Huggingface transformers library is used for inference by default. While this is a very general solution, there are scenarios where more efficient inference methods are needed to speed up the process, such as leveraging VLLM or LMDeploy.
During the OpenCompass evaluation process, the Huggingface transformers library is used for inference by default. While this is a very general solution, there are scenarios where more efficient inference methods are needed to speed up the process, such as leveraging VLLM, LMDeploy, SGLang, or OpenAI-compatible APIs.

- [LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit designed for compressing, deploying, and serving large language models (LLMs), developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
- [vLLM](https://github.com/vllm-project/vllm) is a fast and user-friendly library for LLM inference and serving, featuring advanced serving throughput, efficient PagedAttention memory management, continuous batching of requests, fast model execution via CUDA/HIP graphs, quantization techniques (e.g., GPTQ, AWQ, SqueezeLLM, FP8 KV Cache), and optimized CUDA kernels.
- [SGLang](https://github.com/sgl-project/sglang) is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
- **OpenAI-compatible APIs** allow you to use any OpenAI-compatible endpoint for model inference, including official OpenAI models or self-hosted models with OpenAI-compatible API interfaces.

## Preparation for Acceleration

First, check whether the model you want to evaluate supports inference acceleration using vLLM or LMDeploy. Additionally, ensure you have installed vLLM or LMDeploy as per their official documentation. Below are the installation methods for reference:
First, check whether the model you want to evaluate supports inference acceleration using vLLM, LMDeploy, SGLang, or OpenAI-compatible APIs. Additionally, ensure you have installed the required backend as per their official documentation. Below are the installation methods for reference:

### LMDeploy Installation Method

Expand All @@ -27,11 +29,27 @@ Install vLLM using pip or from [source](https://vllm.readthedocs.io/en/latest/ge
pip install vllm
```

## Accelerated Evaluation Using VLLM or LMDeploy
### SGLang Installation Method

Install SGLang using pip or from [source](https://github.com/sgl-project/sglang):

```bash
pip install sglang
```

### OpenAI API Setup

For OpenAI-compatible APIs, you only need to install the openai package:

```bash
pip install openai
```

## Accelerated Evaluation Using VLLM, LMDeploy, SGLang, or OpenAI

### Method 1: Using Command Line Parameters to Change the Inference Backend

OpenCompass offers one-click evaluation acceleration. During evaluation, it can automatically convert Huggingface transformer models to VLLM or LMDeploy models for use. Below is an example code for evaluating the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model:
OpenCompass offers one-click evaluation acceleration. During evaluation, it can automatically convert Huggingface transformer models to VLLM, LMDeploy, SGLang, or OpenAI models for use. Below is an example code for evaluating the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model:

```python
# eval_gsm8k.py
Expand Down Expand Up @@ -68,21 +86,33 @@ To evaluate the GSM8k dataset using the default Huggingface version of the llama
python run.py config/eval_gsm8k.py
```

To accelerate the evaluation using vLLM or LMDeploy, you can use the following script:
To accelerate the evaluation using vLLM, LMDeploy, SGLang, or OpenAI, you can use the following script:

**Using vLLM:**
```bash
python run.py config/eval_gsm8k.py -a vllm
```

or

**Using LMDeploy:**
```bash
python run.py config/eval_gsm8k.py -a lmdeploy
```

**Using SGLang:**
```bash
python run.py config/eval_gsm8k.py -a sglang
```

**Using OpenAI API:**
```bash
python run.py config/eval_gsm8k.py -a openai
```

Note: For OpenAI backend, you may need to configure additional parameters in your model config, such as `openai_api_base` and `api_key`.

### Method 2: Accelerating Evaluation via Deployed Inference Acceleration Service API

OpenCompass also supports accelerating evaluation by deploying vLLM or LMDeploy inference acceleration service APIs. Follow these steps:
OpenCompass also supports accelerating evaluation by deploying vLLM, LMDeploy, or SGLang inference acceleration service APIs. Follow these steps:

1. Install the openai package:

Expand Down
50 changes: 40 additions & 10 deletions docs/zh_cn/advanced_guides/accelerator_intro.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
# 使用 vLLM 或 LMDeploy 来一键式加速评测推理
# 使用 vLLM、LMDeploy、SGLangOpenAI 来一键式加速评测推理

## 背景

在 OpenCompass 评测过程中,默认使用 Huggingface 的 transformers 库进行推理,这是一个非常通用的方案,但在某些情况下,我们可能需要更高效的推理方法来加速这一过程,比如借助 VLLM 或 LMDeploy
在 OpenCompass 评测过程中,默认使用 Huggingface 的 transformers 库进行推理,这是一个非常通用的方案,但在某些情况下,我们可能需要更高效的推理方法来加速这一过程,比如借助 VLLM、LMDeploy、SGLangOpenAI 兼容的 API

- [LMDeploy](https://github.com/InternLM/lmdeploy) 是一个用于压缩、部署和服务大型语言模型(LLM)的工具包,由 [MMRazor](https://github.com/open-mmlab/mmrazor) 和 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 团队开发。
- [vLLM](https://github.com/vllm-project/vllm) 是一个快速且易于使用的 LLM 推理和服务库,具有先进的服务吞吐量、高效的 PagedAttention 内存管理、连续批处理请求、CUDA/HIP 图的快速模型执行、量化技术(如 GPTQ、AWQ、SqueezeLLM、FP8 KV Cache)以及优化的 CUDA 内核。
- [SGLang](https://github.com/sgl-project/sglang) 是一个为大型语言模型(LLM)设计的结构化生成语言。它使您与模型的交互更快、更可控。
- **OpenAI 兼容 API** 允许您使用任何 OpenAI 兼容的端点进行模型推理,包括官方 OpenAI 模型或自托管的具有 OpenAI 兼容 API 接口的模型。

## 加速前准备

首先,请检查您要评测的模型是否支持使用 vLLM 或 LMDeploy 进行推理加速。其次,请确保您已经安装了 vLLM 或 LMDeploy,具体安装方法请参考它们的官方文档,下面是参考的安装方法:
首先,请检查您要评测的模型是否支持使用 vLLM、LMDeploy、SGLangOpenAI 兼容 API 进行推理加速。其次,请确保您已经安装了所需的后端,具体安装方法请参考它们的官方文档,下面是参考的安装方法:

### LMDeploy 安装方法

Expand All @@ -27,11 +29,27 @@ pip install lmdeploy
pip install vllm
```

## 评测时使用 VLLM 或 LMDeploy
### SGLang 安装方法

使用 pip 或从 [源码](https://github.com/sgl-project/sglang) 安装 SGLang:

```bash
pip install sglang
```

### OpenAI API 设置

对于 OpenAI 兼容的 API,您只需要安装 openai 包:

```bash
pip install openai
```

## 评测时使用 VLLM、LMDeploy、SGLang 或 OpenAI

### 方法1:使用命令行参数来变更推理后端

OpenCompass 提供了一键式的评测加速,可以在评测过程中自动将 Huggingface 的 transformers 模型转化为 VLLM 或 LMDeploy 的模型,以便在评测过程中使用。以下是使用默认 Huggingface 版本的 llama3-8b-instruct 模型评测 GSM8k 数据集的样例代码:
OpenCompass 提供了一键式的评测加速,可以在评测过程中自动将 Huggingface 的 transformers 模型转化为 VLLM、LMDeploy、SGLangOpenAI 的模型,以便在评测过程中使用。以下是使用默认 Huggingface 版本的 llama3-8b-instruct 模型评测 GSM8k 数据集的样例代码:

```python
# eval_gsm8k.py
Expand Down Expand Up @@ -68,29 +86,41 @@ models = [
python run.py config/eval_gsm8k.py
```

如果需要使用 vLLM 或 LMDeploy 进行加速评测,可以使用下面的脚本:
如果需要使用 vLLM、LMDeploy、SGLangOpenAI 进行加速评测,可以使用下面的脚本:

**使用 vLLM:**
```bash
python run.py config/eval_gsm8k.py -a vllm
```


**使用 LMDeploy:**
```bash
python run.py config/eval_gsm8k.py -a lmdeploy
```

**使用 SGLang:**
```bash
python run.py config/eval_gsm8k.py -a sglang
```

**使用 OpenAI API:**
```bash
python run.py config/eval_gsm8k.py -a openai
```

注意:对于 OpenAI 后端,您可能需要在模型配置中配置额外的参数,如 `openai_api_base` 和 `api_key`。

### 方法2:通过部署推理加速服务API来加速评测

OpenCompass 还支持通过部署vLLM或LMDeploy的推理加速服务 API 来加速评测,参考步骤如下
OpenCompass 还支持通过部署 vLLM、LMDeploy 或 SGLang 的推理加速服务 API 来加速评测,参考步骤如下:

1. 安装openai包:

```bash
pip install openai
```

2. 部署 vLLM 或 LMDeploy 的推理加速服务 API,具体部署方法请参考它们的官方文档,下面以LMDeploy为例
2. 部署 vLLM、LMDeploySGLang 的推理加速服务 API,具体部署方法请参考它们的官方文档,下面以 LMDeploy 为例

```bash
lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
Expand Down
4 changes: 2 additions & 2 deletions opencompass/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ def parse_args():
default=False)
parser.add_argument(
'-a', '--accelerator',
help='Infer accelerator, support vllm and lmdeploy now.',
choices=['vllm', 'lmdeploy', None],
help='Infer accelerator backend. Supports: vllm, lmdeploy, sglang, openai.',
choices=['vllm', 'lmdeploy', 'sglang', 'openai', None],
default=None,
type=str)
parser.add_argument('-m',
Expand Down
Loading