# 知识处理流水线

## 文档维护规范

0. **先改文档，再动代码**
1. **文档分层，链接代码** — 格式：`module/file.py:function_name`
2. **简洁快照，日志分离** — 决策依据记录在 `knowhub/docs/decisions.md`

---

## 概述

知识通过 `POST /api/knowledge` 提交后，经过自动化处理流水线完成去重和工具关联分析，最终变为可检索的知识。

实现：`knowhub/server.py:KnowledgeProcessor`

---

## 状态流转

```
pending → processing → dedup_passed → analyzing → approved
                ↓                         ↓
             rejected                  approved
                                      (跳过分析)
```

| 状态 | 含义 |
|---|---|
| `pending` | 等待处理 |
| `processing` | 去重判断中（乐观锁，超时自动重置） |
| `dedup_passed` | 去重通过，等待工具关联分析 |
| `analyzing` | 工具关联分析中 |
| `approved` | 通过，可被检索 |
| `rejected` | 被判定为重复（duplicate/subset） |
| `checked` | 人工已验证（approved ↔ checked 切换） |

---

## 阶段一：去重判断

实现：`knowhub/server.py:KnowledgeProcessor._process_one`

```
新知识（status=pending）
  ↓
复用入库时已生成的 embedding（不重复调用）
  ↓
向量召回 top-10 相似知识（filter: approved/checked）
  ↓
相似度预过滤（阈值 0.75）
  ↓ 无候选 → 直接 dedup_passed
LLM 关系判断（见下文）
  ↓
final_decision=rejected → 旧知识 helpful+1
final_decision=approved → 双向写入 relationships → dedup_passed
```

### LLM 关系判断

使用 `google/gemini-2.5-flash-lite` 判断新知识与候选的关系。

**关系类型**（开放，LLM 可自定义）：

| 类型 | 含义 | 处理 |
|---|---|---|
| `duplicate` | task 和 content 语义完全相同 | rejected，旧知识 helpful+1 |
| `subset` | 新知识信息被旧知识覆盖 | rejected |
| `superset` | 新知识比旧知识更全面 | 两条都 approved |
| `conflict` | 同一 task 下结论矛盾 | 两条都 approved |
| `complement` | 同一 task 的不同角度 | 两条都 approved |
| `none` | task 语义不同或无实质关系 | approved，不写关系 |

关系双向写入：A superset B 时，A 记录 `{type: "superset", target: "B"}`，B 记录 `{type: "subset", target: "A"}`。

Prompt 实现：`knowhub/kb_manage_prompts.py`

---

## 阶段二：工具关联分析

实现：`knowhub/server.py:KnowledgeProcessor._analyze_tool_relation`

```
dedup_passed 的知识
  ↓
LLM 分析知识内容 → 识别提及的工具
  ↓
匹配 tool_table 中的已有工具
  ↓
更新 knowledge.tools[] 和 tool_table.*_knowledge[] 双向关联
  ↓
status → approved
```

使用 `qwen3.5-plus` 模型分析。

---

## 并发控制

- `process_pending()` 使用 asyncio.Lock 防止并发执行
- 乐观锁：processing 状态通过 updated_at 时间戳锁定，超时（60秒）自动重置为 pending
- `POST /api/knowledge/process?force=true` 可强制重置卡住的状态

---

## 触发时机

- `POST /api/knowledge` 成功后自动触发（异步后台任务）
- `POST /api/extract` 成功后自动触发
- `POST /api/knowledge/process` 手动触发