Xinference是什么？2026年开源AI模型部署与推理平台详解

概述

Xorbits Inference (Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。) 是一个功能强大的开源平台，旨在简化和统一各类 AI 模型的部署、推理与管理流程。它支持在云端或本地环境中运行包括大语言模型、嵌入模型以及多模态模型在内的多种开源模型，并提供了丰富的接口和工具，助力开发者快速构建强大的 AI 应用。

Xorbits Inference (Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。) is a powerful open-source platform designed to simplify and unify the deployment, inference, and management workflows for various AI models. It supports running a wide range of open-source models, including large language models, embedding models, and multimodal models, in both cloud and local environments. With a rich set of interfaces and tools, it empowers developers to rapidly build robust AI applications.

核心资源:

官网: https://xorbits.cn/inference
GitHub: https://github.com/xorbitsai/inference/tree/main
官方文档: https://inference.readthedocs.io/zh-cn/latest/index.html

核心特性

简化的模型推理

Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。极大地简化了大语言模型、语音识别模型和多模态模型的部署流程。用户通常只需一个命令即可完成模型的部署工作。

Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。 significantly simplifies the deployment process for large language models, speech recognition models, and multimodal models. Users can often deploy a model with just a single command.

丰富的前沿模型

框架内置了众多前沿的中英文大语言模型，如 Baichuan、ChatGLM2 等，用户可以一键体验。内置模型列表还在持续快速更新中。

The framework comes pre-loaded with numerous cutting-edge Chinese and English large language models, such as Baichuan and ChatGLM2, allowing users to experience them with one click. The built-in model list is continuously and rapidly updated.

异构硬件支持

通过集成 GGML一种模型格式和推理库，支持在CPU和GPU上高效运行大型语言模型，通过量化技术优化内存使用和计算速度。，Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。能够同时利用 GPU 与 CPU 进行推理，有效降低延迟并提高吞吐量。

By integrating GGML一种模型格式和推理库，支持在CPU和GPU上高效运行大型语言模型，通过量化技术优化内存使用和计算速度。, Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。 can leverage both GPU and CPU for inference, effectively reducing latency and improving throughput.

多样化的接口调用

平台提供多种使用模型的接口，包括：

OpenAI 兼容的 RESTful APIXinference提供的OpenAI兼容接口，允许用户通过HTTP请求调用模型功能，包括聊天、生成和函数调用。（支持 Function Calling）
RPC
命令行工具
Web UI
这极大地方便了模型的管理与交互。

The platform offers multiple interfaces for model interaction, including:

OpenAI-compatible RESTful APIXinference提供的OpenAI兼容接口，允许用户通过HTTP请求调用模型功能，包括聊天、生成和函数调用。 (with Function Calling support)
RPC
Command-line tools
Web UI
This greatly facilitates model management and interaction.

集群计算与分布式协同

支持分布式部署Xinference的集群计算功能，通过supervisor和worker架构实现模型在不同机器间的调度，以充分利用集群资源。，通过内置的资源调度器，可以将不同大小的模型按需调度到不同的机器上，从而充分利用集群资源。

It supports distributed deployment. Through its built-in resource scheduler, models of different sizes can be scheduled to different machines on-demand, making full use of cluster resources.

开放的生态系统

能够与流行的第三方库无缝对接，包括 LangChain、LlamaIndex、Dify、FastGPT、RAGFlow、Chatbox 等。

It seamlessly integrates with popular third-party libraries, including LangChain, LlamaIndex, Dify, FastGPT, RAGFlow, Chatbox, and more.

模型支持概览

Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。提供了广泛的模型支持，涵盖当前 AI 应用的主要领域。

1.1 大语言模型支持

平台兼容支持所有主流的大语言模型。用户可以参考内置模型列表获取详细信息。

The platform is compatible with all mainstream large language models. Users can refer to the Built-in Models List for details.

1.2 嵌入模型

支持开源的词嵌入模型，例如 BAAI/bge-large-zh-v1.5。详细信息请参阅嵌入模型文档。

It supports open-source text embedding models, such as BAAI/bge-large-zh-v1.5. Please refer to the Embedding Models Documentation for details.

1.3 重排序模型

支持重排序模型，例如 BAAI/bge-reranker-large。相关文档链接：重排序模型。

It supports reranker models, such as BAAI/bge-reranker-large. Relevant documentation link: Rerank Models.

1.4 图像模型

Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。支持图像模型，可用于文生图、图生图等功能。目前内置了 Stable Diffusion 的各个版本。部署方式与文本模型类似，通过 Web GUI 界面启动即可。请注意，由于 SD 模型较大，部署前请确保服务器有 50GB 以上的可用空间。

Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。 supports image models for tasks like text-to-image and image-to-image generation. It currently includes various versions of Stable Diffusion. Deployment is similar to text models, initiated via the Web GUI. Note: Due to the large size of SD models, ensure the server has at least 50GB of free space before deployment.

1.5 语音模型

语音模型是 Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。新增的功能，支持语音转文字、语音翻译等。部署前需要先安装 ffmpeg 组件。以 Ubuntu 为例，安装命令如下：

sudo apt update && sudo apt install ffmpeg

Speech models are a recent addition to Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。, supporting speech-to-text, speech translation, etc. The ffmpeg component must be installed before deployment. For Ubuntu, the installation command is:

sudo apt update && sudo apt install ffmpeg

1.6 模型来源管理

默认从 HuggingFace 下载模型。如需从其他源（如 ModelScope）下载，可通过设置环境变量 XINFERENCE_MODEL_SRC 实现。例如，从 ModelScope 下载：

XINFERENCE_MODEL_SRC=modelscope xinference-local

By default, models are downloaded from HuggingFace. To download from other sources (e.g., ModelScope), set the environment variable XINFERENCE_MODEL_SRC. Example for ModelScope:

XINFERENCE_MODEL_SRC=modelscope xinference-local

1.7 模型生命周期管理

除了启动模型，Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。提供了完整的模型生命周期管理能力。

列出支持的模型类型:
```
xinference registrations -t LLM
```
列出运行中的模型:
```
xinference list
```

停止指定模型:

xinference terminate --model-uid "qwen2"

Beyond launching models, Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。 provides comprehensive model lifecycle management capabilities.

List supported model types:
```
xinference registrations -t LLM
```
List running models:
```
xinference list
```

Terminate a specific model:

xinference terminate --model-uid "qwen2"

安装指南

2.1 本地源码安装

首先，需要准备 Python 3.9 或更高版本的环境。建议使用 Conda 创建独立环境：

conda create --name xinference python=3.11
conda activate xinference

First, prepare a Python 3.9 or higher environment. It's recommended to use Conda to create an isolated environment:

conda create --name xinference python=3.11
conda activate xinference

然后，安装 Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。及其推理引擎后端。以下命令展示了不同后端的安装方式：

# 安装基础版本
pip install "xinference"
# 安装 GGML 后端支持
pip install "xinference[ggml]"
# 安装 PyTorch 后端支持
pip install "xinference[pytorch]"
# 安装所有推理后端（Transformers, vLLM, GGML等）
pip install "xinference[all]"
# 使用国内镜像加速安装特定后端
pip install "xinference[transformers]" -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install "xinference[vllm]" -i https://pypi.tuna.tsinghua.edu.cn/simple

Then, install Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。 and its inference engine backends. The following commands demonstrate installation for different backends:

# Install the base version
pip install "xinference"
# Install GGML backend support
pip install "xinference[ggml]"
# Install PyTorch backend support
pip install "xinference[pytorch]"
# Install all inference backends (Transformers, vLLM, GGML, etc.)
pip install "xinference[all]"
# Install specific backends using a domestic mirror for faster download
pip install "xinference[transformers]" -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install "xinference[vllm]" -i https://pypi.tuna.tsinghua.edu.cn/simple

安装完成后，建议验证 PyTorch 是否能正常识别 GPU：

python -c "import torch; print(torch.cuda.is_available())"

如果输出 True，则安装正常。

After installation, it's recommended to verify if PyTorch can recognize the GPU correctly:

python -c "import torch; print(torch.cuda.is_available())"

If the output is True, the installation is successful.

2.1.1 解决 `llama-cpp-python` 安装错误

如果通过 pip 源码编译安装 llama-cpp-python 失败，通常是由于系统缺少 cmake 或合适版本的 gcc。解决方案是直接下载官方预编译的 wheel 文件进行离线安装。

If installing llama-cpp-python via pip from source fails, it's often due to missing cmake or a suitable version of gcc on the system. The solution is to download the official pre-compiled wheel file for offline installation.

从 GitHub Releases 页面根据你的系统环境（Python版本、CUDA版本）下载对应的 .whl 文件。

使用 pip 安装下载的 wheel 文件：

pip install llama_cpp_python-<version>.whl

Download the corresponding .whl file from the GitHub Releases page based on your system environment (Python version, CUDA version).
Install the downloaded wheel file using pip:
pip install llama_cpp_python-<version>.whl

2.2 Docker 安装

对于希望使用容器化部署的用户，Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。提供了官方 Docker 镜像。

For users who prefer containerized deployment, Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。 provides official Docker images.

镜像来源:

Docker Hub: xprobe/xinference
阿里云镜像仓库（供国内用户使用）: registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:<tag>

可用标签:

nightly-main: 每日从主分支构建，可能不稳定。
v<release version>: 每个发布版本对应的稳定镜像。
latest: 指向最新的稳定发布版本。
对于仅需 CPU 的版本，添加 -cpu 后缀，如 nightly-main-cpu。

Available Tags:

nightly-main: Built daily from the main branch, may be unstable.
v<release version>: Stable image corresponding to each release version.
latest: Points to the latest stable release version.
For CPU-only versions, add the -cpu suffix, e.g., nightly-main-cpu.

启动命令示例（使用 GPU）:

docker run -p 9998:9997 --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0

关键参数说明:

--gpus all: 必须指定，以使用宿主机的 GPU。
-H 0.0.0.0: 必须指定，允许从容器外部访问服务。
-e: 可指定环境变量，如 -e XINFERENCE_MODEL_SRC=modelscope。

Example Startup Command (using GPU):

docker run -p 9998:9997 --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0

Key Parameter Notes:

--gpus all: Must be specified to use the host's GPU.
-H 0.0.0.0: Must be specified to allow access to the service from outside the container.
-e: Can be used to specify environment variables, e.g., -e XINFERENCE_MODEL_SRC=modelscope.

2.2.2 挂载模型目录

为了避免每次启动容器都重新下载模型，可以将宿主机的模型缓存目录挂载到容器内。

To avoid re-downloading models every time the container starts, you can mount the host machine's model cache directories into the container.

基本挂载示例:

docker run -v /host/model/path:/container/path -e XINFERENCE_HOME=/container/path -p 9997:9997 --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0

Basic Mount Example:

docker run -v /host/model/path:/container/path -e XINFERENCE_HOME=/container/path -p 9997:9997 --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0

完整挂载示例（包含 HuggingFace 和 ModelScope 缓存）:

docker run \
  -v ~/.xinference:/root/.xinference \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/.cache/modelscope:/root/.cache/modelscope \
  -p 9997:9997 \
  --gpus all \
  xprobe/xinference:latest \
  xinference-local -H 0.0.0.0

Complete Mount Example (including HuggingFace and ModelScope caches):

docker run \
  -v ~/.xinference:/root/.xinference \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/.cache/modelscope:/root/.cache/modelscope \
  -p 9997:9997 \
  --gpus all \
  xprobe/xinference:latest \
  xinference-local -H 0.0.0.0

启动与使用

3. 启动 Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。服务

使用以下命令启动本地服务，并允许外部访问：

xinference-local --host 0.0.0.0 --port 7861

服务启动后，默认的 Web UI 可以通过 http://<your_server_ip>:7861 访问。

Use the following command to start the local service and allow external access:

xinference-local --host 0.0.0.0 --port 7861

After the service starts, the default Web UI can be accessed at http://<your_server_ip>:7861.

3.1 模型下载与推理引擎

Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。会根据模型格式和运行环境自动选择最优的推理引擎。

Xinference一个开源平台，用于简化各种AI模型（包括大语言模型、嵌入模型和多模态模型）的部署、推理和集成，支持云端和本地环境。 automatically selects the optimal inference engine based on the model format and runtime environment.

vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。引擎: 适用于 PyTorch、GPTQ 或 AWQ 格式的模型，在 Linux 系统且拥有 CUDA 设备时，能提供高吞吐量推理。
Llama.cpp 引擎: 通过 llama-cpp-python 支持 GGUF 和 GGML一种模型格式和推理库，支持在CPU和GPU上高效运行大型语言模型，通过量化技术优化内存使用和计算速度。格式的模型。建议根据硬件手动安装以获得最佳加速：
- Apple M 系列: CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
- 英伟达显卡: CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
- AMD 显卡: CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
SGLang 引擎: 适用于需要高效执行复杂提示词链的场景，通过 RadixAttention 重用 KV 缓存来加速。安装: pip install 'xinference[sglang]'

vLLM一个高性能的LLM推理和服务库，为DeepSeek-OCR提供优化的推理能力，支持流式输出和批量处理。 Engine: Suitable for models in PyTorch, GPTQ, or AWQ formats. Provides high-throughput inference on Linux systems with CUDA devices.

Llama.cpp Engine: Supports GGUF and GGML一种模型格式和推理库，支持在CPU和GPU上高效运行大型语言模型，通过量化技术优化内存使用和计算速度。 format models via llama-cpp-python. Manual installation for specific hardware is recommended for optimal acceleration:

Apple M Series: CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

NVIDIA GPUs: CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

AMD GPUs: CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

SGLang Engine: Suitable for scenarios requiring efficient execution of complex prompt chains. Accelerates by reusing KV cache via RadixAttention. Installation: pip install 'xinference[sglang]'

3.2 模型部署

通过

概述