Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 1 addition & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ ______________________________________________________________________
- \[2024/09\] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
- \[2024/08\] LMDeploy is integrated into [modelscope/swift](https://github.com/modelscope/swift) as the default accelerator for VLMs inference
- \[2024/07\] Support Llama3.1 8B, 70B and its TOOLS CALLING
- \[2024/07\] Support [InternVL2](docs/en/multi_modal/internvl.md) full-series models, [InternLM-XComposer2.5](docs/en/multi_modal/xcomposer2d5.md) and [function call](docs/en/llm/api_server_tools.md) of InternLM2.5
- \[2024/07\] Support [InternVL2](docs/en/multi_modal/internvl.md) full-series models, InternLM-XComposer2.5 and [function call](docs/en/llm/api_server_tools.md) of InternLM2.5
- \[2024/06\] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
- \[2024/05\] Balance vision model when deploying VLMs with multiple GPUs
- \[2024/05\] Support 4-bits weight-only quantization and inference on VLMs, such as InternVL v1.5, LLaVa, InternLMXComposer2
Expand Down Expand Up @@ -128,20 +128,16 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
<li>Llama3 (8B, 70B)</li>
<li>Llama3.1 (8B, 70B)</li>
<li>Llama3.2 (1B, 3B)</li>
<li>InternLM (7B - 20B)</li>
<li>InternLM2 (7B - 20B)</li>
<li>InternLM3 (8B)</li>
<li>InternLM2.5 (7B)</li>
<li>Qwen (1.8B - 72B)</li>
<li>Qwen1.5 (0.5B - 110B)</li>
<li>Qwen1.5 - MoE (0.5B - 72B)</li>
<li>Qwen2 (0.5B - 72B)</li>
<li>Qwen2-MoE (57BA14B)</li>
<li>Qwen2.5 (0.5B - 32B)</li>
<li>Qwen3, Qwen3-MoE</li>
<li>Qwen3-Next(80B)</li>
<li>Baichuan (7B)</li>
<li>Baichuan2 (7B-13B)</li>
<li>Code Llama (7B - 34B)</li>
<li>ChatGLM2 (6B)</li>
<li>GLM-4 (9B)</li>
Expand All @@ -156,7 +152,6 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
<li>DeepSeek-V3.2 (685B)</li>
<li>Mixtral (8x7B, 8x22B)</li>
<li>Gemma (2B - 7B)</li>
<li>StarCoder2 (3B - 15B)</li>
<li>Phi-3-mini (3.8B)</li>
<li>Phi-3.5-mini (3.8B)</li>
<li>Phi-3.5-MoE (16x3.8B)</li>
Expand All @@ -171,9 +166,6 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
<td>
<ul>
<li>LLaVA(1.5,1.6) (7B-34B)</li>
<li>InternLM-XComposer2 (7B, 4khd-7B)</li>
<li>InternLM-XComposer2.5 (7B)</li>
<li>Qwen-VL (7B)</li>
<li>Qwen2-VL (2B, 7B, 72B)</li>
<li>Qwen2.5-VL (3B, 7B, 72B)</li>
<li>Qwen3-VL (2B - 235B)</li>
Expand All @@ -190,7 +182,6 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
<li>Intern-S1-mini (8.3B)</li>
<li>Intern-S1-Pro (1TB)</li>
<li>Intern-S2-Preview (35B-A3B)</li>
<li>Mono-InternVL (2B)</li>
<li>ChemVLM (8B-26B)</li>
<li>CogVLM-Chat (17B)</li>
<li>CogVLM2-Chat (19B)</li>
Expand All @@ -200,7 +191,6 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
<li>Phi-3.5-vision (4.2B)</li>
<li>GLM-4V (9B)</li>
<li>GLM-4.1V-Thinking (9B)</li>
<li>Llama3.2-vision (11B, 90B)</li>
<li>Molmo (7B-D,72B)</li>
<li>Gemma3 (1B - 27B)</li>
<li>Llama4 (Scout, Maverick)</li>
Expand Down
12 changes: 1 addition & 11 deletions README_ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ ______________________________________________________________________

- \[2024/08\] 🔥🔥 LMDeployは[modelscope/swift](https://github.com/modelscope/swift)に統合され、VLMs推論のデフォルトアクセラレータとなりました
- \[2024/07\] 🎉🎉 Llama3.1 8B、70Bおよびそのツールコールをサポート
- \[2024/07\] [InternVL2](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e)全シリーズモデル、[InternLM-XComposer2.5](docs/en/multi_modal/xcomposer2d5.md)およびInternLM2.5の[ファンクションコール](docs/en/llm/api_server_tools.md)をサポート
- \[2024/07\] [InternVL2](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e)全シリーズモデル、InternLM-XComposer2.5およびInternLM2.5の[ファンクションコール](docs/en/llm/api_server_tools.md)をサポート
- \[2024/06\] PyTorchエンジンはDeepSeek-V2およびいくつかのVLMs、例えばCogVLM2、Mini-InternVL、LlaVA-Nextをサポート
- \[2024/05\] 複数のGPUでVLMsをデプロイする際にビジョンモデルをバランスさせる
- \[2024/05\] InternVL v1.5、LLaVa、InternLMXComposer2などのVLMsで4ビットの重みのみの量子化と推論をサポート
Expand Down Expand Up @@ -115,20 +115,16 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
<li>Llama3 (8B, 70B)</li>
<li>Llama3.1 (8B, 70B)</li>
<li>Llama3.2 (1B, 3B)</li>
<li>InternLM (7B - 20B)</li>
<li>InternLM2 (7B - 20B)</li>
<li>InternLM3 (8B)</li>
<li>InternLM2.5 (7B)</li>
<li>Qwen (1.8B - 72B)</li>
<li>Qwen1.5 (0.5B - 110B)</li>
<li>Qwen1.5 - MoE (0.5B - 72B)</li>
<li>Qwen2 (0.5B - 72B)</li>
<li>Qwen2-MoE (57BA14B)</li>
<li>Qwen2.5 (0.5B - 32B)</li>
<li>Qwen3, Qwen3-MoE</li>
<li>Qwen3-Next(80B)</li>
<li>Baichuan (7B)</li>
<li>Baichuan2 (7B-13B)</li>
<li>Code Llama (7B - 34B)</li>
<li>ChatGLM2 (6B)</li>
<li>GLM-4 (9B)</li>
Expand All @@ -143,7 +139,6 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
<li>DeepSeek-V3.2 (685B)</li>
<li>Mixtral (8x7B, 8x22B)</li>
<li>Gemma (2B - 7B)</li>
<li>StarCoder2 (3B - 15B)</li>
<li>Phi-3-mini (3.8B)</li>
<li>Phi-3.5-mini (3.8B)</li>
<li>Phi-3.5-MoE (16x3.8B)</li>
Expand All @@ -158,9 +153,6 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
<td>
<ul>
<li>LLaVA(1.5,1.6) (7B-34B)</li>
<li>InternLM-XComposer2 (7B, 4khd-7B)</li>
<li>InternLM-XComposer2.5 (7B)</li>
<li>Qwen-VL (7B)</li>
<li>Qwen2-VL (2B, 7B, 72B)</li>
<li>Qwen2.5-VL (3B, 7B, 72B)</li>
<li>Qwen3-VL (2B - 235B)</li>
Expand All @@ -174,7 +166,6 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
<li>InternVL3.5 (1B-241BA28B)</li>
<li>Intern-S1 (241B)</li>
<li>Intern-S1-mini (8.3B)</li>
<li>Mono-InternVL (2B)</li>
<li>ChemVLM (8B-26B)</li>
<li>CogVLM-Chat (17B)</li>
<li>CogVLM2-Chat (19B)</li>
Expand All @@ -184,7 +175,6 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
<li>Phi-3.5-vision (4.2B)</li>
<li>GLM-4V (9B)</li>
<li>GLM-4.1V-Thinking (9B)</li>
<li>Llama3.2-vision (11B, 90B)</li>
<li>Molmo (7B-D,72B)</li>
<li>Gemma3 (1B - 27B)</li>
<li>Llama4 (Scout, Maverick)</li>
Expand Down
12 changes: 1 addition & 11 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ ______________________________________________________________________
- \[2024/09\] 通过引入 CUDA Graph,LMDeploy PyTorchEngine 在 Llama3-8B 推理上实现了 1.3 倍的加速
- \[2024/08\] LMDeploy现已集成至 [modelscope/swift](https://github.com/modelscope/swift),成为 VLMs 推理的默认加速引擎
- \[2024/07\] 支持 Llama3.1 8B 和 70B 模型,以及工具调用功能
- \[2024/07\] 支持 [InternVL2](docs/zh_cn/multi_modal/internvl.md) 全系列模型,[InternLM-XComposer2.5](docs/zh_cn/multi_modal/xcomposer2d5.md) 模型和 InternLM2.5 的 [function call 功能](docs/zh_cn/llm/api_server_tools.md)
- \[2024/07\] 支持 [InternVL2](docs/zh_cn/multi_modal/internvl.md) 全系列模型,InternLM-XComposer2.5 模型和 InternLM2.5 的 [function call 功能](docs/zh_cn/llm/api_server_tools.md)
- \[2024/06\] PyTorch engine 支持了 DeepSeek-V2 和若干 VLM 模型推理, 比如 CogVLM2,Mini-InternVL,LlaVA-Next
- \[2024/05\] 在多 GPU 上部署 VLM 模型时,支持把视觉部分的模型均分到多卡上
- \[2024/05\] 支持InternVL v1.5, LLaVa, InternLMXComposer2 等 VLMs 模型的 4bit 权重量化和推理
Expand Down Expand Up @@ -130,20 +130,16 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
<li>Llama3 (8B, 70B)</li>
<li>Llama3.1 (8B, 70B)</li>
<li>Llama3.2 (1B, 3B)</li>
<li>InternLM (7B - 20B)</li>
<li>InternLM2 (7B - 20B)</li>
<li>InternLM3 (8B)</li>
<li>InternLM2.5 (7B)</li>
<li>Qwen (1.8B - 72B)</li>
<li>Qwen1.5 (0.5B - 110B)</li>
<li>Qwen1.5 - MoE (0.5B - 72B)</li>
<li>Qwen2 (0.5B - 72B)</li>
<li>Qwen2-MoE (57BA14B)</li>
<li>Qwen2.5 (0.5B - 32B)</li>
<li>Qwen3, Qwen3-MoE</li>
<li>Qwen3-Next(80B)</li>
<li>Baichuan (7B)</li>
<li>Baichuan2 (7B-13B)</li>
<li>Code Llama (7B - 34B)</li>
<li>ChatGLM2 (6B)</li>
<li>GLM-4 (9B)</li>
Expand All @@ -158,7 +154,6 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
<li>DeepSeek-V3.2 (685B)</li>
<li>Mixtral (8x7B, 8x22B)</li>
<li>Gemma (2B - 7B)</li>
<li>StarCoder2 (3B - 15B)</li>
<li>Phi-3-mini (3.8B)</li>
<li>Phi-3.5-mini (3.8B)</li>
<li>Phi-3.5-MoE (16x3.8B)</li>
Expand All @@ -173,9 +168,6 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
<td>
<ul>
<li>LLaVA(1.5,1.6) (7B-34B)</li>
<li>InternLM-XComposer2 (7B, 4khd-7B)</li>
<li>InternLM-XComposer2.5 (7B)</li>
<li>Qwen-VL (7B)</li>
<li>Qwen2-VL (2B, 7B, 72B)</li>
<li>Qwen2.5-VL (3B, 7B, 72B)</li>
<li>Qwen3-VL (2B - 235B)</li>
Expand All @@ -192,7 +184,6 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
<li>Intern-S1-mini (8.3B)</li>
<li>Intern-S1-Pro (1TB)</li>
<li>Intern-S2-Preview (35B-A3B)</li>
<li>Mono-InternVL (2B)</li>
<li>ChemVLM (8B-26B)</li>
<li>CogVLM-Chat (17B)</li>
<li>CogVLM2-Chat (19B)</li>
Expand All @@ -202,7 +193,6 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
<li>Phi-3.5-vision (4.2B)</li>
<li>GLM-4V (9B)</li>
<li>GLM-4.1V-Thinking (9B)</li>
<li>Llama3.2-vision (11B, 90B)</li>
<li>Molmo (7B-D,72B)</li>
<li>Gemma3 (1B - 27B)</li>
<li>Llama4 (Scout, Maverick)</li>
Expand Down
4 changes: 0 additions & 4 deletions autotest/utils/get_run_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,6 @@ def get_model_name(model):
return 'internvl-internlm2'
if ('internlm2') in model_name:
return 'internlm2'
if ('internlm-xcomposer2d5') in model_name:
return 'internlm-xcomposer2d5'
if ('internlm-xcomposer2') in model_name:
return 'internlm-xcomposer2'
if ('glm-4') in model_name:
return 'glm4'
if len(model_name.split('-')) > 2 and '-'.join(model_name.split('-')[0:2]) in model_names:
Expand Down
18 changes: 9 additions & 9 deletions docs/en/inference/load_hf.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,18 @@ Starting from v0.1.0, Turbomind adds the ability to pre-process the model parame

Currently, Turbomind support loading three types of model:

1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
2. Other LM models on huggingface.co like Qwen/Qwen-7B-Chat
1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), etc.
2. Other LM models on huggingface.co like Qwen/Qwen2.5-7B-Instruct

## Usage

### 1) A lmdeploy-quantized model

For models quantized by `lmdeploy.lite` such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
For models quantized by `lmdeploy.lite` such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), etc.

```
repo_id=internlm/internlm-chat-20b-4bit
model_name=internlm-chat-20b
repo_id=lmdeploy/llama2-chat-70b-4bit
model_name=llama2-chat-70b
# or
# repo_id=/path/to/downloaded_model

Expand All @@ -30,13 +30,13 @@ lmdeploy serve api_server $repo_id --model-name $model_name --tp 1

### 2) Other LM models

For other LM models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat. LMDeploy supported models can be viewed through `lmdeploy list`.
For other LM models such as Qwen/Qwen2.5-7B-Instruct or internlm/internlm2-chat-7b. LMDeploy supported models can be viewed through `lmdeploy list`.

```
repo_id=Qwen/Qwen-7B-Chat
model_name=qwen-7b
repo_id=Qwen/Qwen2.5-7B-Instruct
model_name=qwen2.5-7b
# or
# repo_id=/path/to/Qwen-7B-Chat/local_path
# repo_id=/path/to/Qwen2.5-7B-Instruct/local_path

# Inference by TurboMind
lmdeploy chat $repo_id --model-name $model_name
Expand Down
2 changes: 1 addition & 1 deletion docs/en/llm/api_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ curl http://{server_ip}:{server_port}/v1/models
curl http://{server_ip}:{server_port}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm-chat-7b",
"model": "intern-s2-preview",
"messages": [{"role": "user", "content": "Hello! How are you?"}]
}'
```
Expand Down
6 changes: 3 additions & 3 deletions docs/en/llm/api_server_anthropic.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ curl http://{server_ip}:{server_port}/v1/messages \
-H "content-type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "internlm-chat-7b",
"model": "intern-s2-preview",
"max_tokens": 128,
"messages": [{"role": "user", "content": "Hello from Anthropic client"}]
}'
Expand All @@ -42,7 +42,7 @@ curl http://{server_ip}:{server_port}/v1/messages \
-H "content-type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "internlm-chat-7b",
"model": "intern-s2-preview",
"max_tokens": 128,
"messages": [{"role": "user", "content": "Find lmdeploy docs"}],
"tools": [{
Expand Down Expand Up @@ -78,7 +78,7 @@ curl http://{server_ip}:{server_port}/v1/messages/count_tokens \
-H "content-type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "internlm-chat-7b",
"model": "intern-s2-preview",
"system": "You are a helpful assistant.",
"messages": [{"role": "user", "content": "Count these tokens"}]
}'
Expand Down
1 change: 0 additions & 1 deletion docs/en/multi_modal/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ Vision-Language Models
deepseek_vl2.md
llava.md
internvl.md
xcomposer2d5.md
cogvlm.md
minicpmv.md
phi3.md
Expand Down
1 change: 0 additions & 1 deletion docs/en/multi_modal/internvl.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ LMDeploy supports the following InternVL series of models, which are detailed in
| InternVL2 | 4B | PyTorch |
| InternVL2 | 1B-2B, 8B-76B | TurboMind, PyTorch |
| InternVL2.5/2.5-MPO/3 | 1B-78B | TurboMind, PyTorch |
| Mono-InternVL | 2B | PyTorch |

The next chapter demonstrates how to deploy an InternVL model using LMDeploy, with [InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B) as an example.

Expand Down
7 changes: 3 additions & 4 deletions docs/en/multi_modal/qwen2_vl.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,9 @@

LMDeploy supports the following Qwen-VL series of models, which are detailed in the table below:

| Model | Size | Supported Inference Engine |
| :----------: | :----: | :------------------------: |
| Qwen-VL-Chat | - | TurboMind |
| Qwen2-VL | 2B, 7B | PyTorch |
| Model | Size | Supported Inference Engine |
| :------: | :----: | :------------------------: |
| Qwen2-VL | 2B, 7B | PyTorch |

The next chapter demonstrates how to deploy an Qwen-VL model using LMDeploy, with [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) as an example.

Expand Down
Loading
Loading