如何搭建 Vearch 向量数据库

最新推荐文章于 2025-06-04 15:49:38 发布

阿湯哥

最新推荐文章于 2025-06-04 15:49:38 发布

阅读量432

点赞数 4

文章标签：数据库

本文链接：https://blog.csdn.net/ttyy1112/article/details/147479981

版权

如何搭建 Vearch 向量数据库

Vearch 是一个开源的分布式向量搜索系统，由京东开发并开源，适用于大规模向量相似性搜索场景。以下是搭建 Vearch 的详细步骤：

一、环境准备

系统要求

Linux 系统 (推荐 Ubuntu 18.04+ 或 CentOS 7+)
Docker 和 Docker Compose (容器化部署方式)
Go 1.13+ (如需从源码编译)
Python 3.6+ (客户端使用)

硬件建议

CPU: 4核以上
内存: 8GB以上 (根据数据规模调整)
磁盘: SSD存储推荐

二、安装方式选择

1. 快速体验 (Docker 方式)

# 拉取最新镜像
docker pull vearch/vearch:latest

# 启动单机版
docker run -d -p 8817:8817 -p 9001:9001 vearch/vearch:latest all

2. 生产环境部署 (集群模式)

下载发布包

wget https://github.com/vearch/vearch/releases/download/v3.3.7/vearch-3.3.7.tar.gz
tar -zxvf vearch-3.3.7.tar.gz
cd vearch-3.3.7

配置集群

编辑 config.toml 文件：

[global]
name = "vearch"
data = ["datas/"]
log = "logs/"
level = "info"
signkey = "vearch"

[monitor]
port = 9008

[router]
port = 9001
skip_auth = true

[master]
name = "m1"
address = "127.0.0.1"
port = 8817

启动集群

# 启动master节点
./bin/vearch -conf config.toml master start

# 启动router节点
./bin/vearch -conf config.toml router start

# 启动ps节点(存储和计算节点)
./bin/vearch -conf config.toml ps start

三、基本操作

1. 创建集合(Collection)

import vearch

# 连接集群
client = vearch.Client("http://localhost:8817")

# 定义schema
schema = {
    "name": "image_search",
    "partition_num": 1,
    "replica_num": 1,
    "engine": {
        "name": "gamma",
        "index_size": 10000,
        "retrieval_type": "IVFPQ",
        "retrieval_param": {
            "metric_type": "InnerProduct",
            "ncentroids": 256,
            "nsubvector": 32
        }
    },
    "properties": {
        "image_vec": {
            "type": "vector",
            "dimension": 128,
            "store_type": "MemoryOnly"
        },
        "image_id": {
            "type": "string",
            "index": True
        }
    }
}

# 创建集合
client.create_collection("test_db", schema)

2. 插入向量数据

collection = client.collection("test_db", "image_search")

doc = {
    "image_id": "img001",
    "image_vec": [0.12, 0.23, ..., 0.98]  # 128维向量
}

collection.insert(doc)

3. 向量搜索

query_vec = [0.11, 0.22, ..., 0.99]  # 查询向量

results = collection.search(
    vector=query_vec,
    vector_field="image_vec",
    topk=10,
    params={"nprobe": 20}
)

for result in results:
    print(result["image_id"], result["_score"])

四、性能优化配置

1. 索引参数调整

"retrieval_param": {
    "metric_type": "L2",  # 距离度量方式
    "ncentroids": 512,    # 聚类中心数
    "nsubvector": 64      # 子向量数
}

2. 资源分配

[ps]
# 每个ps节点的线程数
rpc_port = 9002
heartbeat_port = 9003
raft_port = 9004
recover_port = 9005
data_port = 9006

3. 缓存配置

[engine]
max_size = 1000000  # 最大缓存文档数

五、集群扩展

增加PS节点

在新机器上部署相同程序
修改config.toml中的master地址
启动时指定角色：

./bin/vearch -conf config.toml ps start

数据分片

schema = {
    "partition_num": 4,  # 分片数
    "replica_num": 2     # 副本数
}

六、监控与维护

内置监控接口

http://<master_ip>:8817/_cluster/stats
http://<router_ip>:9001/_cluster/health

日志查看

tail -f logs/vearch.log

七、常见问题解决

启动失败：检查端口是否冲突，确保8817,9001等端口可用
搜索性能差：调整nprobe参数和索引类型
内存不足：减少max_size或增加机器资源

Vearch 提供了丰富的API和配置选项，可以根据实际应用场景灵活调整。对于生产环境，建议部署3个以上master节点保证高可用，并根据数据规模合理规划PS节点数量。