缓存路径配置说明

概述

本项目已实现统一的缓存路径管理，所有缓存数据默认存储在 ~/cache/ 目录下，通过 lib/config.py 模块进行配置。

目录结构

~/cache/                        # 缓存根目录（默认：~/cache，可配置）
├── text_embedding/            # 向量相似度计算缓存
├── semantic_similarity/       # 语义相似度计算缓存
└── data/                      # 数据缓存（爬虫、分析等）
    ├── search/                # 搜索结果缓存
    ├── detail/                # 详情数据缓存
    └── tools_list/            # 工具列表缓存

data/                          # 非缓存数据（项目数据、配置等）
├── 阿里多多酱/                # 账号相关数据
├── data_1117/                # 特定日期数据
└── ...                       # 其他非缓存文件

使用方法

1. 使用默认配置（推荐）

默认情况下，所有缓存文件存储在用户主目录的 ~/cache/ 目录下，无需任何配置：

from lib.text_embedding import compare_phrases

# 计算缓存：~/cache/text_embedding/
result = compare_phrases("深度学习", "神经网络")

# 数据缓存：~/cache/data/search/
python script/search/ai_search.py --query "深度学习"

2. 通过代码设置缓存根目录

在程序开始时，可以通过代码设置全局的缓存根目录，所有缓存（包括计算缓存和数据缓存）都会使用新路径：

from lib.config import set_cache_root
from lib.text_embedding import compare_phrases

# 设置缓存根目录
set_cache_root("/custom/cache")

# 计算缓存：/custom/cache/text_embedding/
result = compare_phrases("深度学习", "神经网络")

# 数据缓存：/custom/cache/data/search/
# 运行爬虫脚本时也会使用新路径

3. 通过环境变量设置缓存根目录

可以在运行程序前设置环境变量，所有缓存都会使用新路径：

# Linux/Mac
export CACHE_ROOT=/custom/cache
python your_script.py
# 计算缓存 -> /custom/cache/text_embedding/
# 数据缓存 -> /custom/cache/data/search/

# Windows
set CACHE_ROOT=C:\custom\cache
python your_script.py

4. 为单次调用指定缓存目录

如果只想为特定调用指定缓存目录：

计算缓存：

from lib.text_embedding import compare_phrases

# 为这次调用指定特殊的缓存目录
result = compare_phrases(
    "深度学习",
    "神经网络",
    cache_dir="/tmp/my_custom_cache"
)

数据缓存：

# 通过命令行参数指定
python script/search/ai_search.py --query "test" --results-dir /custom/output

配置优先级

计算缓存优先级

函数参数 cache_dir - 优先级最高
代码中调用 set_cache_root() - 中等优先级
环境变量 CACHE_ROOT - 较低优先级
默认值 ~/cache - 优先级最低

数据缓存优先级

命令行参数 --results-dir - 优先级最高
代码中调用 set_cache_root() - 中等优先级（影响 ~/cache/data/）
环境变量 CACHE_ROOT - 较低优先级（影响 ~/cache/data/）
默认值 ~/cache/data/ - 优先级最低

涉及的模块

计算缓存（cache/）

lib/text_embedding.py - 向量相似度缓存（cache/text_embedding/）
lib/semantic_similarity.py - 语义相似度缓存（cache/semantic_similarity/）
lib/hybrid_similarity.py - 混合相似度缓存
script/analysis/analyze_model_comparison.py - 模型对比分析
script/analysis/test_all_models.py - 模型测试

数据缓存（cache/data/）

script/search/ - 搜索结果缓存（cache/data/search/）
- ai_search.py, custom_search.py, douyin_search.py, xiaohongshu_search.py
script/detail/ - 详情数据缓存（cache/data/detail/）
- xiaohongshu_detail.py
script/get_tools_list.py - 工具列表缓存（cache/data/tools_list/）
script/search_recommendations/ - 搜索推荐缓存（cache/data/search_recommendations/）
script/search_tagwords/ - 搜索标签词缓存（cache/data/search_tagwords/）

非缓存数据（data/）

账号相关数据（data/阿里多多酱/, data/账号/）
特定日期数据（data/data_1117/, data/data_1118/等）
分析脚本（data/*.py）
分析结果（data/*.xlsx, data/*.json）
文档（data/*.md）

示例代码

示例 1: 使用默认配置

from lib.text_embedding import compare_phrases

result = compare_phrases("如何更换花呗绑定银行卡", "花呗更改绑定银行卡")
print(f"相似度: {result['相似度']:.3f}")
# 缓存位置: cache/text_embedding/

示例 2: 设置全局缓存根目录

from lib.config import set_cache_root, get_cache_root
from lib.text_embedding import compare_phrases
from lib.semantic_similarity import compare_phrases as compare_phrases_semantic
import asyncio

# 设置全局缓存根目录
set_cache_root("/path/to/custom/cache")

print(f"当前缓存根目录: {get_cache_root()}")
# 输出: /path/to/custom/cache

# 所有模块都会使用新的缓存路径
result1 = compare_phrases("深度学习", "神经网络")
# 缓存位置: /path/to/custom/cache/text_embedding/

result2 = asyncio.run(compare_phrases_semantic("深度学习", "神经网络"))
# 缓存位置: /path/to/custom/cache/semantic_similarity/

示例 3: 使用环境变量

# 在运行脚本前设置环境变量
# export CACHE_ROOT=/Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache

from lib.config import get_cache_root
from lib.text_embedding import compare_phrases

print(f"缓存根目录: {get_cache_root()}")
# 输出: /Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache

result = compare_phrases("测试", "示例")
# 缓存位置: /Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache/text_embedding/

示例 4: 混合相似度模块配置

from lib.hybrid_similarity import compare_phrases
from lib.config import set_cache_root
import asyncio

# 方式1: 使用全局配置
set_cache_root("/custom/cache")
result = asyncio.run(compare_phrases("深度学习", "神经网络"))
# 向量模型缓存: /custom/cache/text_embedding/
# 语义模型缓存: /custom/cache/semantic_similarity/

# 方式2: 分别指定缓存目录
result = asyncio.run(compare_phrases(
    "深度学习",
    "神经网络",
    cache_dir_embedding="/path/to/embedding/cache",
    cache_dir_semantic="/path/to/semantic/cache"
))

示例 5: 在脚本中使用

# script/my_analysis.py
import sys
from pathlib import Path

# 添加项目根目录到路径
project_root = Path(__file__).parent.parent
sys.path.insert(0, str(project_root))

from lib.config import set_cache_root, get_cache_dir

# 设置缓存根目录
set_cache_root("/Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache")

# 获取特定模块的缓存目录
text_embedding_cache = get_cache_dir("text_embedding")
semantic_similarity_cache = get_cache_dir("semantic_similarity")

print(f"向量模型缓存: {text_embedding_cache}")
print(f"语义模型缓存: {semantic_similarity_cache}")

API 参考

lib.config 模块

缓存路径相关

`get_cache_root() -> str`

获取当前的缓存根目录。

`set_cache_root(path: str) -> None`

设置缓存根目录。

参数:

path: 缓存根目录路径（可以是绝对路径或相对路径）

`get_cache_dir(subdir: str) -> str`

获取特定子模块的缓存目录。

参数:

subdir: 子目录名称，如 "text_embedding", "semantic_similarity"

返回:

数据路径相关

`get_data_root() -> str`

获取当前的数据根目录。

`set_data_root(path: str) -> None`

设置数据根目录。

参数:

path: 数据根目录路径（可以是绝对路径或相对路径）

`get_data_dir(subdir: str = "") -> str`

获取特定子模块的数据目录。

参数:

subdir: 子目录名称，如 "search", "detail", "tools_list" 等。如果为空字符串，返回数据根目录

返回:

注意事项

路径格式: 支持绝对路径和相对路径，相对路径相对于当前工作目录
自动创建: 缓存目录会在首次写入时自动创建
线程安全: 配置模块是线程安全的，可以在多线程环境中使用
环境变量优先级: 如果同时设置了环境变量和代码配置，代码配置优先级更高

迁移指南

如果你之前使用的是硬编码的缓存路径（如 /Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache），现在可以：

方式 1: 设置环境变量（推荐）

export CACHE_ROOT=/Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache

然后正常运行你的脚本，无需修改代码。

方式 2: 在代码开头设置

在你的脚本开头添加：

from lib.config import set_cache_root

set_cache_root("/Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache")

方式 3: 使用相对路径

如果你想让缓存路径相对于项目目录：

from lib.config import set_cache_root
from pathlib import Path

project_root = Path(__file__).parent.parent
cache_path = project_root / "cache"
set_cache_root(str(cache_path))

常见问题

Q: 我可以为不同的模块使用不同的缓存根目录吗？

A: 目前不支持。所有模块共享同一个缓存根目录，但你可以在调用时使用 cache_dir 参数为单次调用指定不同的路径。

Q: 修改缓存路径后，旧的缓存文件会自动迁移吗？

A: 不会。你需要手动移动缓存文件到新的目录，或者让程序重新生成缓存。

Q: 如何清空缓存？

A: 直接删除缓存目录即可：rm -rf cache/text_embedding/* 或 rm -rf cache/semantic_similarity/*

Q: 缓存文件占用空间过大怎么办？

A: 可以定期清理旧的缓存文件，或者设置缓存到临时目录（如 /tmp/cache）。

Q: cache/ 和 data/ 目录有什么区别？

cache/: 所有可以重新生成的缓存数据
- cache/text_embedding/ - 计算缓存
- cache/semantic_similarity/ - 计算缓存
- cache/data/ - 数据缓存（爬虫、工具列表等）
data/: 不可重新生成的项目数据
- 账号数据、特定日期的分析结果、文档等

Q: 为什么 cache/data/ 也叫缓存？

A: 因为爬虫采集的数据（search、detail、tools_list）都可以通过重新运行脚本获取，本质上是可重新生成的缓存数据。统一放在 cache/ 下便于管理和清理。

CACHE_CONFIG.md 11 KB Vēsture Neapstrādāts

缓存路径配置说明

概述

目录结构

使用方法

1. 使用默认配置（推荐）

2. 通过代码设置缓存根目录

3. 通过环境变量设置缓存根目录

4. 为单次调用指定缓存目录

配置优先级

计算缓存优先级

数据缓存优先级

涉及的模块

计算缓存（cache/）

数据缓存（cache/data/）

非缓存数据（data/）

示例代码

示例 1: 使用默认配置

示例 2: 设置全局缓存根目录

示例 3: 使用环境变量

示例 4: 混合相似度模块配置

示例 5: 在脚本中使用

API 参考

lib.config 模块

缓存路径相关

get_cache_root() -> str

set_cache_root(path: str) -> None

get_cache_dir(subdir: str) -> str

数据路径相关

get_data_root() -> str

set_data_root(path: str) -> None

get_data_dir(subdir: str = "") -> str

注意事项

迁移指南

方式 1: 设置环境变量（推荐）

方式 2: 在代码开头设置

方式 3: 使用相对路径

常见问题

CACHE_CONFIG.md 11 KB

Vēsture Neapstrādāts

`get_cache_root() -> str`

`set_cache_root(path: str) -> None`

`get_cache_dir(subdir: str) -> str`

`get_data_root() -> str`

`set_data_root(path: str) -> None`

`get_data_dir(subdir: str = "") -> str`