简化版的文本相似度计算模块,使用远程GPU加速API,去除了缓存机制(API已经足够快)。
from lib.text_embedding_api import (
compare_phrases, # 1. 单对计算
compare_phrases_batch, # 2. 批量成对
compare_phrases_cartesian # 3. 笛卡尔积
)
result = compare_phrases("深度学习", "神经网络")
print(result['相似度']) # 0.8500
print(result['说明']) # 基于向量模型计算的语义相似度为 高 (0.85)
适用场景:有N对独立的文本需要分别计算相似度
pairs = [
("深度学习", "神经网络"),
("机器学习", "人工智能"),
("Python编程", "Python开发")
]
results = compare_phrases_batch(pairs)
for (a, b), result in zip(pairs, results):
print(f"{a} vs {b}: {result['相似度']:.4f}")
适用场景:需要计算两组文本之间所有可能的组合(M×N)
phrases_a = ["深度学习", "机器学习"]
phrases_b = ["神经网络", "人工智能", "Python"]
results = compare_phrases_cartesian(phrases_a, phrases_b)
# 访问结果
print(results[0][0]['相似度']) # 深度学习 vs 神经网络
print(results[1][2]['说明']) # 机器学习 vs Python
matrix = compare_phrases_cartesian(phrases_a, phrases_b, return_matrix=True)
print(matrix.shape) # (2, 3)
print(matrix[0, 1]) # 深度学习 vs 人工智能
print(matrix[1, 0]) # 机器学习 vs 神经网络
| 场景 | 数据量 | 耗时 |
|---|---|---|
| 单对计算 | 1对 | ~30ms |
| 批量成对 | 100对 | ~200ms |
| 笛卡尔积 | 10×100=1000 | ~500ms |
from lib.text_embedding_api import get_api_health
health = get_api_health()
print(health['status']) # "ok"
print(health['gpu_available']) # True
print(health['max_cartesian_texts']) # 最大文本数限制
from lib.text_embedding_api import compare_phrases_cartesian
feature = "宿命感"
persona_features = ["人设1", "人设2", ..., "人设100"]
# 一次API调用获取所有100个相似度
matrix = compare_phrases_cartesian([feature], persona_features, return_matrix=True)
scores = matrix[0] # 取第一行
for i, score in enumerate(scores):
if score > 0.7: # 只处理高相似度
print(f"{feature} → {persona_features[i]}: {score:.4f}")
性能: ~0.2秒(vs 逐对调用 ~10秒)
features = ["特征1", "特征2", ..., "特征10"]
persona_features = ["人设1", "人设2", ..., "人设100"]
# 一次API调用获取10×100=1000个相似度
matrix = compare_phrases_cartesian(features, persona_features, return_matrix=True)
# 处理结果
for i, feature in enumerate(features):
for j, persona in enumerate(persona_features):
score = matrix[i, j]
if score > 0.7:
print(f"{feature} → {persona}: {score:.4f}")
性能: ~0.5秒(vs 逐对调用 ~100秒)
compare_phrases() 接口完全兼容:
# 原来的代码
from lib.text_embedding import compare_phrases
# 新代码(直接替换)
from lib.text_embedding_api import compare_phrases
# 使用方式完全相同
result = compare_phrases("测试1", "测试2")
区别:
pip install requests numpy
python3 lib/text_embedding_api.py
默认API地址: http://61.48.133.26:8187
如需修改,可在代码中设置:
from lib.text_embedding_api import SimilarityAPIClient
client = SimilarityAPIClient(
base_url="http://your-api-server:8187",
timeout=120
)
3个接口,无缓存,专注性能:
compare_phrases(a, b) - 单对compare_phrases_batch([(a,b),...]) - 批量成对compare_phrases_cartesian([...], [...]) - 笛卡尔积 ⭐推荐: 优先使用笛卡尔积接口处理批量数据,性能最优。