# 小红书搜索模块

## 快速开始

### Python API（推荐）

```python
from script.search import search_xiaohongshu
from script.detail import get_xiaohongshu_detail

# 搜索笔记
data = search_xiaohongshu("产品测试")

# 获取详情
for note in data['notes']:
    note_id = note['channel_content_id']
    detail = get_xiaohongshu_detail(note_id)
    print(f"{detail['title']}")
    print(f"{detail['body_text']}")  # 完整正文
```

### 命令行工具

```bash
# 搜索
python script/search/xiaohongshu_search.py --keyword "产品测试"

# 详情
python script/detail/xiaohongshu_detail.py --note-id "6915588b00000000040143b5"
```

---

## API 文档

### 1. 搜索接口

#### 函数签名

```python
data = search_xiaohongshu(
    keyword: str,           # 必填：搜索关键词
    content_type="不限",    # 可选：不限、视频、图文
    sort_type="综合",       # 可选：综合、最新、最多点赞、最多评论
    publish_time="不限",    # 可选：不限、一天内、一周内、半年内
    page=1,                # 可选：页码（自动翻页）
    force=False            # 可选：强制刷新
)
```

### 返回值

```python
{
  "search_params": {      # 搜索参数
    "keyword": "产品测试",
    "content_type": "视频",
    "sort_type": "最新",
    "publish_time": "一周内",
    "cursor": "",
    "page": 1,
    "timestamp": "20251113_133258"
  },
  "has_more": True,       # 是否有更多
  "next_cursor": "...",   # 下一页游标（内部使用）
  "notes": [...]          # 笔记列表
}
```

### 笔记字段

| 字段 | 类型 | 说明 |
|------|------|------|
| channel_content_id | string/null | 笔记ID |
| link | string/null | 笔记链接 |
| title | string/null | 标题 |
| desc | string/null | 摘要（搜索接口返回） |
| body_text | null | 完整正文（搜索接口不返回，需调用详情接口） |
| channel_account_name | string/null | 作者名称 |
| channel_account_id | string/null | 作者ID |
| like_count | number/null | 点赞数 |
| comment_count | number/null | 评论数 |
| collect_count | number/null | 收藏数 |
| shared_count | number/null | 分享数 |
| images | array | 图片URL列表（空数组表示无图片） |
| video | null | 视频链接（搜索接口不返回，需调用详情接口） |
| content_type | string/null | 内容类型（video/note） |

**注意**: 不存在的字段统一用 `null` 表示，而非空字符串或 0。

### 2. 详情接口

#### 函数签名

```python
detail = get_xiaohongshu_detail(
    note_id: str,          # 必填：笔记ID
    force=False            # 可选：强制刷新
)
```

#### 返回值

```python
{
  "channel_content_id": "68d62e4500000000130085fc",
  "link": "https://www.xiaohongshu.com/explore/...",
  "comment_count": null,
  "images": ["http://res.cybertogether.net/..."],
  "like_count": 14,
  "body_text": "完整正文内容...",
  "title": "笔记标题",
  "collect_count": 6,
  "channel_account_id": "664954500000000007006ac0",
  "channel_account_name": "作者名称",
  "content_type": "video",  # 根据 video 字段自动判断：有视频="video"，否则="normal"
  "video": "http://sns-video-hw.xhscdn.com/...",
  "publish_timestamp": 1758877418000,
  "publish_time": "2025-09-26 17:03:38"
}
```

#### 字段说明

| 字段 | 类型 | 说明 |
|------|------|------|
| channel_content_id | string/null | 笔记ID |
| link | string/null | 笔记链接 |
| title | string/null | 标题 |
| body_text | string/null | 完整正文内容 |
| channel_account_name | string/null | 作者名称 |
| channel_account_id | string/null | 作者ID |
| like_count | number/null | 点赞数 |
| comment_count | number/null | 评论数 |
| collect_count | number/null | 收藏数 |
| images | array | 图片URL列表（已去重） |
| video | string/null | 视频链接 |
| content_type | string | 内容类型（"video" 或 "normal"，根据 video 字段自动判断） |
| publish_timestamp | number/null | 发布时间戳（毫秒） |
| publish_time | string/null | 发布时间（格式：YYYY-MM-DD HH:MM:SS） |

**注意**:
- 不存在的字段统一用 `null` 表示
- 图片已自动按顺序去重
- `content_type` 自动判断：有 `video` 为 "video"，否则为 "normal"

#### 使用示例

```python
from script.detail import get_xiaohongshu_detail

# 获取笔记详情
detail = get_xiaohongshu_detail("68d62e4500000000130085fc")

print(f"标题: {detail['title']}")
print(f"正文: {detail['body_text']}")
print(f"视频: {detail['video']}")
print(f"类型: {detail['content_type']}")
print(f"发布时间: {detail['publish_time']}")
```

---

## 使用示例

### 1. 基本搜索

```python
from script.search import search_xiaohongshu

data = search_xiaohongshu("产品测试")

print(f"找到 {len(data['notes'])} 条笔记")
for note in data['notes']:
    print(f"- {note['title']} ({note['like_count']} 赞)")
```

### 2. 带参数搜索

```python
data = search_xiaohongshu(
    keyword="产品测试",
    content_type="视频",
    sort_type="最新",
    publish_time="一周内"
)
```

### 3. 翻页（自动处理）

```python
# 直接指定页码，自动处理 cursor
page1 = search_xiaohongshu("产品测试", page=1)
page2 = search_xiaohongshu("产品测试", page=2)
page3 = search_xiaohongshu("产品测试", page=3)
```

### 4. 强制刷新

```python
# 忽略缓存，重新请求 API
data = search_xiaohongshu("产品测试", force=True)
```

### 5. 批量搜索

```python
keywords = ["产品测试", "软件测试", "性能测试"]

for keyword in keywords:
    data = search_xiaohongshu(keyword)
    print(f"{keyword}: {len(data['notes'])} 条笔记")
```

### 6. 数据分析

```python
from script.search import search_xiaohongshu

def analyze_topic(keyword):
    """分析话题热度"""
    data = search_xiaohongshu(
        keyword=keyword,
        sort_type="最新",
        publish_time="一周内"
    )

    notes = data['notes']
    total_likes = sum(n['like_count'] for n in notes)
    avg_likes = total_likes / len(notes) if notes else 0

    print(f"关键词: {keyword}")
    print(f"笔记数: {len(notes)}")
    print(f"总点赞: {total_likes}")
    print(f"平均点赞: {avg_likes:.1f}")

analyze_topic("产品测试")
```

### 7. 搜索 + 详情（完整正文）

```python
from script.search import search_xiaohongshu
from script.detail import get_xiaohongshu_detail

# 搜索笔记
data = search_xiaohongshu("产品测试", publish_time="一周内")

# 获取前3条的完整详情
for note in data['notes'][:3]:
    note_id = note['channel_content_id']

    # 获取详情
    detail = get_xiaohongshu_detail(note_id)

    print(f"\n标题: {note['title']}")
    print(f"摘要: {note['desc'][:50]}...")
    print(f"完整正文: {detail['result']['body_text'][:100]}...")
    print(f"点赞: {note['like_count']}")
```

---

## 命令行使用

### 搜索接口

#### 基本搜索

```bash
python script/search/xiaohongshu_search.py --keyword "产品测试"
```

#### 带参数搜索

```bash
python script/search/xiaohongshu_search.py \
  --keyword "产品测试" \
  --content-type "视频" \
  --sort-type "最新" \
  --publish-time "一周内"
```

#### 强制刷新

```bash
python script/search/xiaohongshu_search.py --keyword "产品测试" --force
```

#### 禁用缓存

```bash
python script/search/xiaohongshu_search.py --keyword "产品测试" --no-cache
```

#### 完整参数

| 参数 | 默认值 | 说明 |
|------|--------|------|
| --keyword | 必填 | 搜索关键词 |
| --content-type | "不限" | 内容类型：不限、视频、图文 |
| --sort-type | "综合" | 排序：综合、最新、最多点赞、最多评论 |
| --publish-time | "不限" | 时间：不限、一天内、一周内、半年内 |
| --page | 1 | 页码 |
| --cursor | "" | 翻页游标 |
| --force | False | 强制刷新 |
| --no-cache | False | 禁用缓存 |
| --results-dir | data/search | 输出目录 |
| --timeout | 30 | 超时时间（秒） |
| --max-retries | 5 | 最大重试次数 |
| --retry-delay | 2 | 重试延迟（秒） |

### 详情接口

#### 基本使用

```bash
python script/detail/xiaohongshu_detail.py --note-id "6915588b00000000040143b5"
```

#### 强制刷新

```bash
python script/detail/xiaohongshu_detail.py --note-id "6915588b00000000040143b5" --force
```

#### 完整参数

| 参数 | 默认值 | 说明 |
|------|--------|------|
| --note-id | 必填 | 笔记ID |
| --force | False | 强制刷新 |
| --no-cache | False | 禁用缓存 |
| --results-dir | data/detail | 输出目录 |
| --timeout | 30 | 超时时间（秒） |
| --max-retries | 5 | 最大重试次数 |
| --retry-delay | 2 | 重试延迟（秒） |

---

## 核心特性

### 1. 自动缓存（默认开启）

相同的搜索参数会自动使用缓存：

```python
# 第一次：请求 API
data1 = search_xiaohongshu("产品测试")

# 第二次：使用缓存
data2 = search_xiaohongshu("产品测试")  # 瞬间返回

# 强制刷新
data3 = search_xiaohongshu("产品测试", force=True)
```

### 2. 自动重试（失败重试 5 次）

- 超时错误：自动重试
- 连接错误：自动重试
- 5xx 服务器错误：自动重试
- 4xx 客户端错误：不重试

指数退避策略：2秒 → 4秒 → 8秒 → 16秒 → 32秒

### 3. 自动保存（后台完成）

搜索结果自动保存到 `data/search/xiaohongshu_search/`

目录结构：
```
data/search/xiaohongshu_search/
└── {关键词}/
    ├── raw/                           # 原始数据
    │   └── {时间戳}_page{页码}_{参数}.json
    └── clean/                         # 清洗数据
        └── {时间戳}_page{页码}_{参数}.json
```

文件名示例：
- 默认参数：`20251113_133315_page1_不限_综合_不限.json`
- 自定义参数：`20251113_133258_page1_视频_最新_一周内.json`

### 4. 自动翻页（内部处理 cursor）

```python
# 无需手动管理 cursor
page1 = search_xiaohongshu("产品测试", page=1)
page2 = search_xiaohongshu("产品测试", page=2)  # 自动获取 page1 的 cursor
page3 = search_xiaohongshu("产品测试", page=3)  # 自动获取 page2 的 cursor
```

### 5. 关键词自动清理

特殊字符会自动处理，避免文件名冲突：

```python
# 自动清理特殊字符
search_xiaohongshu("测试/产品:问题?")
# → 文件夹名：测试_产品_问题_
```

---

## 数据格式

### Clean 数据（推荐使用）

```json
{
  "search_params": {
    "keyword": "产品测试",
    "content_type": "视频",
    "sort_type": "最新",
    "publish_time": "一周内",
    "cursor": "",
    "page": 1,
    "timestamp": "20251113_133258"
  },
  "has_more": true,
  "next_cursor": "2@2fl1kgnh0gdx2oarsbpxc@...",
  "notes": [
    {
      "channel_content_id": "6915588b00000000040143b5",
      "link": "https://www.xiaohongshu.com/explore/6915588b00000000040143b5",
      "title": "笔记标题",
      "desc": "笔记摘要...",
      "body_text": "",
      "channel_account_name": "作者名称",
      "channel_account_id": "5b1e2c0811be10762dee6859",
      "like_count": 2,
      "comment_count": 0,
      "collect_count": 1,
      "shared_count": 0,
      "images": ["https://..."],
      "video": "",
      "content_type": "video"
    }
  ]
}
```

### Raw 数据

完整的 API 响应，包含所有元数据和嵌套结构。

---

## 注意事项

### 关于 desc 和 body_text

- **desc**：搜索接口返回的摘要（已截断）
- **body_text**：完整正文（空，需调用详情接口 `get_xhs_detail_by_note_id` 获取）

### 关于 video

- 搜索接口不返回视频链接
- 需要调用详情接口获取

### 频率限制

- 建议每次搜索间隔 1-2 秒
- 避免短时间内大量请求

---

## 常见问题

### Q: 如何获取完整正文？

A: 搜索接口只返回摘要，完整正文需要调用详情接口：

```python
# 1. 先搜索获取笔记列表
data = search_xiaohongshu("产品测试")

# 2. 对感兴趣的笔记调用详情接口
note_id = data['notes'][0]['channel_content_id']
# 调用 get_xhs_detail_by_note_id(note_id) 获取完整正文
```

### Q: 缓存如何清理？

A:
- 方式1：手动删除 `data/search/xiaohongshu_search/{关键词}/` 目录
- 方式2：使用 `force=True` 参数强制刷新

### Q: 如何判断是否使用了缓存？

A: 看控制台输出：
- 使用缓存：`✓ 使用缓存数据: ...`
- 请求 API：`正在搜索关键词: ... (尝试 1/3)`

### Q: 翻页时 cursor 在哪里？

A: cursor 已自动处理，无需手动管理：

```python
# ✅ 推荐：直接指定页码
page2 = search_xiaohongshu("产品测试", page=2)

# ❌ 不需要：手动传 cursor
# page2 = search_xiaohongshu("产品测试", cursor="...")
```

---

## 技术细节

### 内部默认配置

- **超时时间**：30 秒
- **最大重试**：5 次
- **重试延迟**：2 秒（指数增长）
- **缓存开关**：默认开启
- **输出目录**：`data/search`

### 缓存机制

- 基于搜索参数生成缓存键（keyword + content_type + sort_type + publish_time + cursor）
- 相同参数返回最新的缓存文件
- 按文件修改时间排序

### 自动翻页原理

```python
# page=2 时自动执行：
# 1. 读取 page=1 的缓存
# 2. 提取 next_cursor
# 3. 使用 cursor 请求 page=2
```