# 缓存路径配置说明

## 概述

本项目已实现统一的缓存路径管理，所有缓存数据默认存储在 `~/cache/` 目录下，通过 `lib/config.py` 模块进行配置。

## 目录结构

```
~/cache/                        # 缓存根目录（默认：~/cache，可配置）
├── text_embedding/            # 向量相似度计算缓存
├── semantic_similarity/       # 语义相似度计算缓存
└── data/                      # 数据缓存（爬虫、分析等）
    ├── search/                # 搜索结果缓存
    ├── detail/                # 详情数据缓存
    └── tools_list/            # 工具列表缓存

data/                          # 非缓存数据（项目数据、配置等）
├── 阿里多多酱/                # 账号相关数据
├── data_1117/                # 特定日期数据
└── ...                       # 其他非缓存文件
```

## 使用方法

### 1. 使用默认配置（推荐）

默认情况下，所有缓存文件存储在用户主目录的 `~/cache/` 目录下，**无需任何配置**：

```python
from lib.text_embedding import compare_phrases

# 计算缓存：~/cache/text_embedding/
result = compare_phrases("深度学习", "神经网络")
```

```bash
# 数据缓存：~/cache/data/search/
python script/search/ai_search.py --query "深度学习"
```

### 2. 通过代码设置缓存根目录

在程序开始时，可以通过代码设置全局的缓存根目录，所有缓存（包括计算缓存和数据缓存）都会使用新路径：

```python
from lib.config import set_cache_root
from lib.text_embedding import compare_phrases

# 设置缓存根目录
set_cache_root("/custom/cache")

# 计算缓存：/custom/cache/text_embedding/
result = compare_phrases("深度学习", "神经网络")

# 数据缓存：/custom/cache/data/search/
# 运行爬虫脚本时也会使用新路径
```

### 3. 通过环境变量设置缓存根目录

可以在运行程序前设置环境变量，所有缓存都会使用新路径：

```bash
# Linux/Mac
export CACHE_ROOT=/custom/cache
python your_script.py
# 计算缓存 -> /custom/cache/text_embedding/
# 数据缓存 -> /custom/cache/data/search/

# Windows
set CACHE_ROOT=C:\custom\cache
python your_script.py
```

### 4. 为单次调用指定缓存目录

如果只想为特定调用指定缓存目录：

**计算缓存：**
```python
from lib.text_embedding import compare_phrases

# 为这次调用指定特殊的缓存目录
result = compare_phrases(
    "深度学习",
    "神经网络",
    cache_dir="/tmp/my_custom_cache"
)
```

**数据缓存：**
```bash
# 通过命令行参数指定
python script/search/ai_search.py --query "test" --results-dir /custom/output
```

## 配置优先级

### 计算缓存优先级

1. **函数参数 `cache_dir`** - 优先级最高
2. **代码中调用 `set_cache_root()`** - 中等优先级
3. **环境变量 `CACHE_ROOT`** - 较低优先级
4. **默认值 `~/cache`** - 优先级最低

### 数据缓存优先级

1. **命令行参数 `--results-dir`** - 优先级最高
2. **代码中调用 `set_cache_root()`** - 中等优先级（影响 ~/cache/data/）
3. **环境变量 `CACHE_ROOT`** - 较低优先级（影响 ~/cache/data/）
4. **默认值 `~/cache/data/`** - 优先级最低

## 涉及的模块

### 计算缓存（cache/）

- **lib/text_embedding.py** - 向量相似度缓存（`cache/text_embedding/`）
- **lib/semantic_similarity.py** - 语义相似度缓存（`cache/semantic_similarity/`）
- **lib/hybrid_similarity.py** - 混合相似度缓存
- **script/analysis/analyze_model_comparison.py** - 模型对比分析
- **script/analysis/test_all_models.py** - 模型测试

### 数据缓存（cache/data/）

- **script/search/** - 搜索结果缓存（`cache/data/search/`）
  - ai_search.py, custom_search.py, douyin_search.py, xiaohongshu_search.py
- **script/detail/** - 详情数据缓存（`cache/data/detail/`）
  - xiaohongshu_detail.py
- **script/get_tools_list.py** - 工具列表缓存（`cache/data/tools_list/`）
- **script/search_recommendations/** - 搜索推荐缓存（`cache/data/search_recommendations/`）
- **script/search_tagwords/** - 搜索标签词缓存（`cache/data/search_tagwords/`）

### 非缓存数据（data/）

- 账号相关数据（`data/阿里多多酱/`, `data/账号/`）
- 特定日期数据（`data/data_1117/`, `data/data_1118/`等）
- 分析脚本（`data/*.py`）
- 分析结果（`data/*.xlsx`, `data/*.json`）
- 文档（`data/*.md`）

## 示例代码

### 示例 1: 使用默认配置

```python
from lib.text_embedding import compare_phrases

result = compare_phrases("如何更换花呗绑定银行卡", "花呗更改绑定银行卡")
print(f"相似度: {result['相似度']:.3f}")
# 缓存位置: cache/text_embedding/
```

### 示例 2: 设置全局缓存根目录

```python
from lib.config import set_cache_root, get_cache_root
from lib.text_embedding import compare_phrases
from lib.semantic_similarity import compare_phrases as compare_phrases_semantic
import asyncio

# 设置全局缓存根目录
set_cache_root("/path/to/custom/cache")

print(f"当前缓存根目录: {get_cache_root()}")
# 输出: /path/to/custom/cache

# 所有模块都会使用新的缓存路径
result1 = compare_phrases("深度学习", "神经网络")
# 缓存位置: /path/to/custom/cache/text_embedding/

result2 = asyncio.run(compare_phrases_semantic("深度学习", "神经网络"))
# 缓存位置: /path/to/custom/cache/semantic_similarity/
```

### 示例 3: 使用环境变量

```python
# 在运行脚本前设置环境变量
# export CACHE_ROOT=/Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache

from lib.config import get_cache_root
from lib.text_embedding import compare_phrases

print(f"缓存根目录: {get_cache_root()}")
# 输出: /Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache

result = compare_phrases("测试", "示例")
# 缓存位置: /Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache/text_embedding/
```

### 示例 4: 混合相似度模块配置

```python
from lib.hybrid_similarity import compare_phrases
from lib.config import set_cache_root
import asyncio

# 方式1: 使用全局配置
set_cache_root("/custom/cache")
result = asyncio.run(compare_phrases("深度学习", "神经网络"))
# 向量模型缓存: /custom/cache/text_embedding/
# 语义模型缓存: /custom/cache/semantic_similarity/

# 方式2: 分别指定缓存目录
result = asyncio.run(compare_phrases(
    "深度学习",
    "神经网络",
    cache_dir_embedding="/path/to/embedding/cache",
    cache_dir_semantic="/path/to/semantic/cache"
))
```

### 示例 5: 在脚本中使用

```python
# script/my_analysis.py
import sys
from pathlib import Path

# 添加项目根目录到路径
project_root = Path(__file__).parent.parent
sys.path.insert(0, str(project_root))

from lib.config import set_cache_root, get_cache_dir

# 设置缓存根目录
set_cache_root("/Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache")

# 获取特定模块的缓存目录
text_embedding_cache = get_cache_dir("text_embedding")
semantic_similarity_cache = get_cache_dir("semantic_similarity")

print(f"向量模型缓存: {text_embedding_cache}")
print(f"语义模型缓存: {semantic_similarity_cache}")
```

## API 参考

### lib.config 模块

#### 缓存路径相关

##### `get_cache_root() -> str`
获取当前的缓存根目录。

##### `set_cache_root(path: str) -> None`
设置缓存根目录。

**参数:**
- `path`: 缓存根目录路径（可以是绝对路径或相对路径）

##### `get_cache_dir(subdir: str) -> str`
获取特定子模块的缓存目录。

**参数:**
- `subdir`: 子目录名称，如 `"text_embedding"`, `"semantic_similarity"`

**返回:**
- 完整的缓存目录路径

#### 数据路径相关

##### `get_data_root() -> str`
获取当前的数据根目录。

##### `set_data_root(path: str) -> None`
设置数据根目录。

**参数:**
- `path`: 数据根目录路径（可以是绝对路径或相对路径）

##### `get_data_dir(subdir: str = "") -> str`
获取特定子模块的数据目录。

**参数:**
- `subdir`: 子目录名称，如 `"search"`, `"detail"`, `"tools_list"` 等。如果为空字符串，返回数据根目录

**返回:**
- 完整的数据目录路径

## 注意事项

1. **路径格式**: 支持绝对路径和相对路径，相对路径相对于当前工作目录
2. **自动创建**: 缓存目录会在首次写入时自动创建
3. **线程安全**: 配置模块是线程安全的，可以在多线程环境中使用
4. **环境变量优先级**: 如果同时设置了环境变量和代码配置，代码配置优先级更高

## 迁移指南

如果你之前使用的是硬编码的缓存路径（如 `/Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache`），现在可以：

### 方式 1: 设置环境变量（推荐）

```bash
export CACHE_ROOT=/Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache
```

然后正常运行你的脚本，无需修改代码。

### 方式 2: 在代码开头设置

在你的脚本开头添加：

```python
from lib.config import set_cache_root

set_cache_root("/Users/semsevens/Desktop/workspace/daily/1113/how_1120_v3/cache")
```

### 方式 3: 使用相对路径

如果你想让缓存路径相对于项目目录：

```python
from lib.config import set_cache_root
from pathlib import Path

project_root = Path(__file__).parent.parent
cache_path = project_root / "cache"
set_cache_root(str(cache_path))
```

## 常见问题

**Q: 我可以为不同的模块使用不同的缓存根目录吗？**

A: 目前不支持。所有模块共享同一个缓存根目录，但你可以在调用时使用 `cache_dir` 参数为单次调用指定不同的路径。

**Q: 修改缓存路径后，旧的缓存文件会自动迁移吗？**

A: 不会。你需要手动移动缓存文件到新的目录，或者让程序重新生成缓存。

**Q: 如何清空缓存？**

A: 直接删除缓存目录即可：`rm -rf cache/text_embedding/*` 或 `rm -rf cache/semantic_similarity/*`

**Q: 缓存文件占用空间过大怎么办？**

A: 可以定期清理旧的缓存文件，或者设置缓存到临时目录（如 `/tmp/cache`）。

**Q: cache/ 和 data/ 目录有什么区别？**

A:
- **cache/**: 所有可以重新生成的缓存数据
  - `cache/text_embedding/` - 计算缓存
  - `cache/semantic_similarity/` - 计算缓存
  - `cache/data/` - 数据缓存（爬虫、工具列表等）
- **data/**: 不可重新生成的项目数据
  - 账号数据、特定日期的分析结果、文档等

**Q: 为什么 cache/data/ 也叫缓存？**

A: 因为爬虫采集的数据（search、detail、tools_list）都可以通过重新运行脚本获取，本质上是可重新生成的缓存数据。统一放在 cache/ 下便于管理和清理。