yangxiaohui 1da7d3647e feat: 实现小红书搜索和详情模块		2 тижнів тому
..
README.md	1da7d3647e feat: 实现小红书搜索和详情模块	2 тижнів тому
__init__.py	1da7d3647e feat: 实现小红书搜索和详情模块	2 тижнів тому
xiaohongshu_detail.py	1da7d3647e feat: 实现小红书搜索和详情模块	2 тижнів тому

小红书详情模块

快速开始

Python API（推荐）

from script.detail import get_xiaohongshu_detail

# 获取笔记详情
detail = get_xiaohongshu_detail("68d62e4500000000130085fc")

print(f"标题: {detail['title']}")
print(f"正文: {detail['body_text']}")
print(f"视频: {detail['video']}")
print(f"类型: {detail['content_type']}")
print(f"发布时间: {detail['publish_time']}")

命令行工具

# 获取详情
python script/detail/xiaohongshu_detail.py --note-id "68d62e4500000000130085fc"

# 强制刷新
python script/detail/xiaohongshu_detail.py --note-id "68d62e4500000000130085fc" --force

API 文档

函数签名

detail = get_xiaohongshu_detail(
    note_id: str,          # 必填：笔记ID
    force=False            # 可选：强制刷新（忽略缓存）
)

返回值

{
  "channel_content_id": "68d62e4500000000130085fc",
  "link": "https://www.xiaohongshu.com/explore/68d62e4500000000130085fc",
  "comment_count": null,
  "images": [
    "http://res.cybertogether.net/crawler/image/bf6a0e92ed7252ae8414121edf26f2d3.jpeg"
  ],
  "like_count": 14,
  "body_text": "时隔两个月，终于有时间把穿越极圈的航拍视频剪辑出来了...",
  "title": "穿越北极圈，终生难忘",
  "collect_count": 6,
  "channel_account_id": "664954500000000007006ac0",
  "channel_account_name": "Colin SW",
  "content_type": "video",  # 根据 video 字段自动判断
  "video": "http://sns-video-hw.xhscdn.com/stream/1/110/258/...",
  "publish_timestamp": 1758877418000,
  "publish_time": "2025-09-26 17:03:38"
}

字段说明

字段	类型	说明
channel_content_id	string/null	笔记ID
link	string/null	笔记链接
title	string/null	标题
body_text	string/null	完整正文内容
channel_account_name	string/null	作者名称
channel_account_id	string/null	作者ID
like_count	number/null	点赞数
comment_count	number/null	评论数
collect_count	number/null	收藏数
images	array	图片URL列表（已去重）
video	string/null	视频链接
content_type	string	内容类型（"video" 或 "normal"）
publish_timestamp	number/null	发布时间戳（毫秒）
publish_time	string/null	发布时间（格式：YYYY-MM-DD HH:MM:SS）

注意:

不存在的字段统一用 null 表示，而非空字符串或 0
图片已自动按顺序去重
content_type 自动判断：有 video 字段时为 "video"，否则为 "normal"

使用示例

1. 基本使用

from script.detail import get_xiaohongshu_detail

# 获取笔记详情
detail = get_xiaohongshu_detail("68d62e4500000000130085fc")

print(f"标题: {detail['title']}")
print(f"正文: {detail['body_text']}")
print(f"点赞: {detail['like_count']}")

2. 强制刷新

# 忽略缓存，重新请求 API
detail = get_xiaohongshu_detail("68d62e4500000000130085fc", force=True)

3. 判断内容类型

detail = get_xiaohongshu_detail("68d62e4500000000130085fc")

if detail['content_type'] == 'video':
    print(f"视频链接: {detail['video']}")
else:
    print(f"图文笔记，图片数量: {len(detail['images'])}")

4. 搜索 + 详情（完整流程）

from script.search import search_xiaohongshu
from script.detail import get_xiaohongshu_detail

# 1. 搜索笔记
search_result = search_xiaohongshu("产品测试", publish_time="一周内")

# 2. 获取前 5 条的完整详情
for note in search_result['notes'][:5]:
    note_id = note['channel_content_id']

    # 获取详情
    detail = get_xiaohongshu_detail(note_id)

    print(f"\n标题: {detail['title']}")
    print(f"摘要: {note['desc'][:50]}...")  # 搜索结果的摘要
    print(f"完整正文: {detail['body_text'][:100]}...")  # 详情的完整正文
    print(f"点赞: {detail['like_count']}")
    print(f"类型: {detail['content_type']}")

5. 批量获取详情

note_ids = [
    "68d62e4500000000130085fc",
    "68b69ea9000000001c035a4d",
    "6808c0e8000000001c00a771"
]

for note_id in note_ids:
    try:
        detail = get_xiaohongshu_detail(note_id)
        print(f"✓ {detail['title']}")
    except Exception as e:
        print(f"✗ {note_id}: {e}")

命令行使用

基本使用

python script/detail/xiaohongshu_detail.py --note-id "68d62e4500000000130085fc"

强制刷新

python script/detail/xiaohongshu_detail.py --note-id "68d62e4500000000130085fc" --force

禁用缓存

python script/detail/xiaohongshu_detail.py --note-id "68d62e4500000000130085fc" --no-cache

完整参数

参数	默认值	说明
--note-id	必填	笔记ID
--force	False	强制刷新（忽略缓存）
--no-cache	False	禁用缓存功能
--results-dir	data/detail	输出目录
--timeout	30	超时时间（秒）
--max-retries	5	最大重试次数
--retry-delay	2	重试延迟（秒）

核心特性

1. 自动缓存（默认开启）

相同的笔记 ID 会自动使用缓存：

# 第一次：请求 API
detail1 = get_xiaohongshu_detail("68d62e4500000000130085fc")

# 第二次：使用缓存（瞬间返回）
detail2 = get_xiaohongshu_detail("68d62e4500000000130085fc")

# 强制刷新
detail3 = get_xiaohongshu_detail("68d62e4500000000130085fc", force=True)

2. 自动重试（失败重试 5 次）

超时错误：自动重试
连接错误：自动重试
5xx 服务器错误：自动重试
4xx 客户端错误：不重试
API 返回失败（success=false）：自动重试

指数退避策略：2秒 → 4秒 → 8秒 → 16秒 → 32秒

3. 自动保存（后台完成）

详情结果自动保存到 data/detail/xiaohongshu_detail/

目录结构：

data/detail/xiaohongshu_detail/
└── {note_id}/
    ├── raw/                           # 原始数据（完整 API 响应）
    │   └── {timestamp}.json
    └── clean/                         # 清洗数据（扁平化结构）
        └── {timestamp}.json

文件名示例：

20251113_144230.json

注意: 只有新请求的数据才会保存，使用缓存时不会重复保存文件。

4. 图片自动去重

图片 URL 会自动按顺序去重：

# 原始数据可能有重复
# images: ["url1", "url1", "url2"]

# 返回的数据已去重
# images: ["url1", "url2"]

5. Content Type 自动判断

根据 video 字段自动判断内容类型：

# 有视频
detail['video'] = "http://..."
detail['content_type'] = "video"

# 无视频
detail['video'] = null
detail['content_type'] = "normal"

6. 时间自动转换

自动将时间戳转换为可读格式：

detail['publish_timestamp'] = 1758877418000  # 毫秒时间戳
detail['publish_time'] = "2025-09-26 17:03:38"  # 格式化时间

数据格式

Clean 数据（推荐使用）

{
  "channel_content_id": "68d62e4500000000130085fc",
  "link": "https://www.xiaohongshu.com/explore/68d62e4500000000130085fc",
  "comment_count": null,
  "images": [
    "http://res.cybertogether.net/crawler/image/bf6a0e92ed7252ae8414121edf26f2d3.jpeg"
  ],
  "like_count": 14,
  "body_text": "完整正文内容...",
  "title": "穿越北极圈，终生难忘",
  "collect_count": 6,
  "channel_account_id": "664954500000000007006ac0",
  "channel_account_name": "Colin SW",
  "content_type": "video",
  "video": "http://sns-video-hw.xhscdn.com/...",
  "publish_timestamp": 1758877418000,
  "publish_time": "2025-09-26 17:03:38"
}

Raw 数据

完整的 API 响应，包含所有元数据和嵌套结构：

{
  "note_id": "68d62e4500000000130085fc",
  "timestamp": "20251113_144230",
  "api_response": {
    "success": true,
    "result": [...],
    "tool_name": "get_xhs_detail_by_note_id",
    "call_type": "api"
  }
}

常见问题

Q: 缓存如何清理？

方式1：手动删除 data/detail/xiaohongshu_detail/{note_id}/ 目录
方式2：使用 force=True 参数强制刷新

Q: 如何判断是否使用了缓存？

A: 看控制台输出：

使用缓存：✓ 使用缓存数据: ...
请求 API：正在获取笔记详情: ... (尝试 1/5)

Q: video 字段为什么有时是 null？

搜索接口不返回 video 字段
详情接口会返回 video 字段（如果笔记有视频的话）
图文笔记没有视频，video 字段为 null

Q: comment_count 为什么是 null？

A: API 返回的数据中，某些字段可能不存在或为 null，我们保持原样返回，不会强制转换为 0。

Q: content_type 如何判断？

自动判断：有 video 字段（非 null）时为 "video"
否则为 "normal"（图文笔记）

技术细节

内部默认配置

超时时间：30 秒
最大重试：5 次
重试延迟：2 秒（指数增长）
缓存开关：默认开启
输出目录：data/detail

缓存机制

基于笔记 ID 生成缓存目录
按文件修改时间排序，返回最新文件
只有新请求成功后才保存缓存

API 成功验证

只有当 API 返回 success: true 时才视为成功并保存缓存，否则会继续重试。

与搜索模块配合使用

详情模块通常与搜索模块配合使用：

from script.search import search_xiaohongshu
from script.detail import get_xiaohongshu_detail

# 1. 搜索笔记（获取摘要）
search_result = search_xiaohongshu("产品测试")

# 2. 对感兴趣的笔记获取详情（获取完整正文和视频）
for note in search_result['notes'][:5]:
    note_id = note['channel_content_id']
    detail = get_xiaohongshu_detail(note_id)

    # 搜索结果的摘要
    print(f"摘要: {note['desc']}")

    # 详情的完整正文
    print(f"完整正文: {detail['body_text']}")

    # 视频（如果有）
    if detail['video']:
        print(f"视频: {detail['video']}")

关键区别：

搜索接口：返回摘要（desc），不返回 body_text 和 video
详情接口：返回完整正文（body_text）和视频链接（video）

README.md