# text_embedding_api - 基于远程API的文本相似度计算

## 概述

简化版的文本相似度计算模块，使用远程GPU加速API，**去除了缓存机制**（API已经足够快）。

## 3种计算模式

```python
from lib.text_embedding_api import (
    compare_phrases,           # 1. 单对计算
    compare_phrases_batch,     # 2. 批量成对
    compare_phrases_cartesian  # 3. 笛卡尔积
)
```

### 1. 单对计算

```python
result = compare_phrases("深度学习", "神经网络")
print(result['相似度'])  # 0.8500
print(result['说明'])    # 基于向量模型计算的语义相似度为 高 (0.85)
```

### 2. 批量成对计算

适用场景：有N对独立的文本需要分别计算相似度

```python
pairs = [
    ("深度学习", "神经网络"),
    ("机器学习", "人工智能"),
    ("Python编程", "Python开发")
]

results = compare_phrases_batch(pairs)
for (a, b), result in zip(pairs, results):
    print(f"{a} vs {b}: {result['相似度']:.4f}")
```

### 3. 笛卡尔积计算 ⭐

适用场景：需要计算两组文本之间所有可能的组合（M×N）

#### 方式A: 返回嵌套列表（带说明）

```python
phrases_a = ["深度学习", "机器学习"]
phrases_b = ["神经网络", "人工智能", "Python"]

results = compare_phrases_cartesian(phrases_a, phrases_b)

# 访问结果
print(results[0][0]['相似度'])  # 深度学习 vs 神经网络
print(results[1][2]['说明'])    # 机器学习 vs Python
```

#### 方式B: 返回numpy矩阵（只有分数，更快）

```python
matrix = compare_phrases_cartesian(phrases_a, phrases_b, return_matrix=True)

print(matrix.shape)  # (2, 3)
print(matrix[0, 1])  # 深度学习 vs 人工智能
print(matrix[1, 0])  # 机器学习 vs 神经网络
```

## 性能对比

| 场景 | 数据量 | 耗时 |
|------|--------|------|
| **单对计算** | 1对 | ~30ms |
| **批量成对** | 100对 | ~200ms |
| **笛卡尔积** | 10×100=1000 | ~500ms |

## API健康检查

```python
from lib.text_embedding_api import get_api_health

health = get_api_health()
print(health['status'])              # "ok"
print(health['gpu_available'])       # True
print(health['max_cartesian_texts']) # 最大文本数限制
```

## 业务集成示例

### 场景1: 一个特征匹配所有人设（1 vs N）

```python
from lib.text_embedding_api import compare_phrases_cartesian

feature = "宿命感"
persona_features = ["人设1", "人设2", ..., "人设100"]

# 一次API调用获取所有100个相似度
matrix = compare_phrases_cartesian([feature], persona_features, return_matrix=True)
scores = matrix[0]  # 取第一行

for i, score in enumerate(scores):
    if score > 0.7:  # 只处理高相似度
        print(f"{feature} → {persona_features[i]}: {score:.4f}")
```

**性能**: ~0.2秒（vs 逐对调用 ~10秒）

### 场景2: 多个特征批量匹配（M vs N）

```python
features = ["特征1", "特征2", ..., "特征10"]
persona_features = ["人设1", "人设2", ..., "人设100"]

# 一次API调用获取10×100=1000个相似度
matrix = compare_phrases_cartesian(features, persona_features, return_matrix=True)

# 处理结果
for i, feature in enumerate(features):
    for j, persona in enumerate(persona_features):
        score = matrix[i, j]
        if score > 0.7:
            print(f"{feature} → {persona}: {score:.4f}")
```

**性能**: ~0.5秒（vs 逐对调用 ~100秒）

## 与 text_embedding.py 的兼容性

`compare_phrases()` 接口完全兼容：

```python
# 原来的代码
from lib.text_embedding import compare_phrases

# 新代码（直接替换）
from lib.text_embedding_api import compare_phrases

# 使用方式完全相同
result = compare_phrases("测试1", "测试2")
```

**区别**:
- ✅ 更快（GPU加速）
- ✅ 零内存占用（无需加载模型）
- ✅ 新增笛卡尔积功能
- ❌ 需要网络连接
- ❌ 无缓存机制（API已经够快，不需要）

## 依赖

```bash
pip install requests numpy
```

## 测试

```bash
python3 lib/text_embedding_api.py
```

## API配置

默认API地址: `http://61.48.133.26:8187`

如需修改，可在代码中设置：

```python
from lib.text_embedding_api import SimilarityAPIClient

client = SimilarityAPIClient(
    base_url="http://your-api-server:8187",
    timeout=120
)
```

## 总结

**3个接口，无缓存，专注性能：**

1. `compare_phrases(a, b)` - 单对
2. `compare_phrases_batch([(a,b),...])` - 批量成对
3. `compare_phrases_cartesian([...], [...])` - 笛卡尔积 ⭐

**推荐**: 优先使用笛卡尔积接口处理批量数据，性能最优。