yangxiaohui 1da7d3647e feat: 实现小红书搜索和详情模块		2 týždňov pred
..
API.md	1da7d3647e feat: 实现小红书搜索和详情模块	2 týždňov pred
README.md	1da7d3647e feat: 实现小红书搜索和详情模块	2 týždňov pred
__init__.py	fcadbef523 add search	2 týždňov pred
ai_search.py	fcadbef523 add search	2 týždňov pred
custom_search.py	fcadbef523 add search	2 týždňov pred
douyin_search.py	fcadbef523 add search	2 týždňov pred
xiaohongshu_search.py	1da7d3647e feat: 实现小红书搜索和详情模块	2 týždňov pred

小红书搜索模块

快速开始

Python API（推荐）

from script.search import search_xiaohongshu
from script.detail import get_xiaohongshu_detail

# 搜索笔记
data = search_xiaohongshu("产品测试")

# 获取详情
for note in data['notes']:
    note_id = note['channel_content_id']
    detail = get_xiaohongshu_detail(note_id)
    print(f"{detail['title']}")
    print(f"{detail['body_text']}")  # 完整正文

命令行工具

# 搜索
python script/search/xiaohongshu_search.py --keyword "产品测试"

# 详情
python script/detail/xiaohongshu_detail.py --note-id "6915588b00000000040143b5"

API 文档

1. 搜索接口

函数签名

data = search_xiaohongshu(
    keyword: str,           # 必填：搜索关键词
    content_type="不限",    # 可选：不限、视频、图文
    sort_type="综合",       # 可选：综合、最新、最多点赞、最多评论
    publish_time="不限",    # 可选：不限、一天内、一周内、半年内
    page=1,                # 可选：页码（自动翻页）
    force=False            # 可选：强制刷新
)

返回值

{
  "search_params": {      # 搜索参数
    "keyword": "产品测试",
    "content_type": "视频",
    "sort_type": "最新",
    "publish_time": "一周内",
    "cursor": "",
    "page": 1,
    "timestamp": "20251113_133258"
  },
  "has_more": True,       # 是否有更多
  "next_cursor": "...",   # 下一页游标（内部使用）
  "notes": [...]          # 笔记列表
}

笔记字段

字段	类型	说明
channel_content_id	string/null	笔记ID
link	string/null	笔记链接
title	string/null	标题
desc	string/null	摘要（搜索接口返回）
body_text	null	完整正文（搜索接口不返回，需调用详情接口）
channel_account_name	string/null	作者名称
channel_account_id	string/null	作者ID
like_count	number/null	点赞数
comment_count	number/null	评论数
collect_count	number/null	收藏数
shared_count	number/null	分享数
images	array	图片URL列表（空数组表示无图片）
video	null	视频链接（搜索接口不返回，需调用详情接口）
content_type	string/null	内容类型（video/note）

注意: 不存在的字段统一用 null 表示，而非空字符串或 0。

2. 详情接口

函数签名

detail = get_xiaohongshu_detail(
    note_id: str,          # 必填：笔记ID
    force=False            # 可选：强制刷新
)

返回值

{
  "channel_content_id": "68d62e4500000000130085fc",
  "link": "https://www.xiaohongshu.com/explore/...",
  "comment_count": null,
  "images": ["http://res.cybertogether.net/..."],
  "like_count": 14,
  "body_text": "完整正文内容...",
  "title": "笔记标题",
  "collect_count": 6,
  "channel_account_id": "664954500000000007006ac0",
  "channel_account_name": "作者名称",
  "content_type": "video",  # 根据 video 字段自动判断：有视频="video"，否则="normal"
  "video": "http://sns-video-hw.xhscdn.com/...",
  "publish_timestamp": 1758877418000,
  "publish_time": "2025-09-26 17:03:38"
}

字段说明

字段	类型	说明
channel_content_id	string/null	笔记ID
link	string/null	笔记链接
title	string/null	标题
body_text	string/null	完整正文内容
channel_account_name	string/null	作者名称
channel_account_id	string/null	作者ID
like_count	number/null	点赞数
comment_count	number/null	评论数
collect_count	number/null	收藏数
images	array	图片URL列表（已去重）
video	string/null	视频链接
content_type	string	内容类型（"video" 或 "normal"，根据 video 字段自动判断）
publish_timestamp	number/null	发布时间戳（毫秒）
publish_time	string/null	发布时间（格式：YYYY-MM-DD HH:MM:SS）

注意:

不存在的字段统一用 null 表示
图片已自动按顺序去重
content_type 自动判断：有 video 为 "video"，否则为 "normal"

使用示例

from script.detail import get_xiaohongshu_detail

# 获取笔记详情
detail = get_xiaohongshu_detail("68d62e4500000000130085fc")

print(f"标题: {detail['title']}")
print(f"正文: {detail['body_text']}")
print(f"视频: {detail['video']}")
print(f"类型: {detail['content_type']}")
print(f"发布时间: {detail['publish_time']}")

使用示例

1. 基本搜索

from script.search import search_xiaohongshu

data = search_xiaohongshu("产品测试")

print(f"找到 {len(data['notes'])} 条笔记")
for note in data['notes']:
    print(f"- {note['title']} ({note['like_count']} 赞)")

2. 带参数搜索

data = search_xiaohongshu(
    keyword="产品测试",
    content_type="视频",
    sort_type="最新",
    publish_time="一周内"
)

3. 翻页（自动处理）

# 直接指定页码，自动处理 cursor
page1 = search_xiaohongshu("产品测试", page=1)
page2 = search_xiaohongshu("产品测试", page=2)
page3 = search_xiaohongshu("产品测试", page=3)

4. 强制刷新

# 忽略缓存，重新请求 API
data = search_xiaohongshu("产品测试", force=True)

5. 批量搜索

keywords = ["产品测试", "软件测试", "性能测试"]

for keyword in keywords:
    data = search_xiaohongshu(keyword)
    print(f"{keyword}: {len(data['notes'])} 条笔记")

6. 数据分析

from script.search import search_xiaohongshu

def analyze_topic(keyword):
    """分析话题热度"""
    data = search_xiaohongshu(
        keyword=keyword,
        sort_type="最新",
        publish_time="一周内"
    )

    notes = data['notes']
    total_likes = sum(n['like_count'] for n in notes)
    avg_likes = total_likes / len(notes) if notes else 0

    print(f"关键词: {keyword}")
    print(f"笔记数: {len(notes)}")
    print(f"总点赞: {total_likes}")
    print(f"平均点赞: {avg_likes:.1f}")

analyze_topic("产品测试")

7. 搜索 + 详情（完整正文）

from script.search import search_xiaohongshu
from script.detail import get_xiaohongshu_detail

# 搜索笔记
data = search_xiaohongshu("产品测试", publish_time="一周内")

# 获取前3条的完整详情
for note in data['notes'][:3]:
    note_id = note['channel_content_id']

    # 获取详情
    detail = get_xiaohongshu_detail(note_id)

    print(f"\n标题: {note['title']}")
    print(f"摘要: {note['desc'][:50]}...")
    print(f"完整正文: {detail['result']['body_text'][:100]}...")
    print(f"点赞: {note['like_count']}")

命令行使用

搜索接口

基本搜索

python script/search/xiaohongshu_search.py --keyword "产品测试"

带参数搜索

python script/search/xiaohongshu_search.py \
  --keyword "产品测试" \
  --content-type "视频" \
  --sort-type "最新" \
  --publish-time "一周内"

强制刷新

python script/search/xiaohongshu_search.py --keyword "产品测试" --force

禁用缓存

python script/search/xiaohongshu_search.py --keyword "产品测试" --no-cache

完整参数

参数	默认值	说明
--keyword	必填	搜索关键词
--content-type	"不限"	内容类型：不限、视频、图文
--sort-type	"综合"	排序：综合、最新、最多点赞、最多评论
--publish-time	"不限"	时间：不限、一天内、一周内、半年内
--page	1	页码
--cursor	""	翻页游标
--force	False	强制刷新
--no-cache	False	禁用缓存
--results-dir	data/search	输出目录
--timeout	30	超时时间（秒）
--max-retries	5	最大重试次数
--retry-delay	2	重试延迟（秒）

详情接口

基本使用

python script/detail/xiaohongshu_detail.py --note-id "6915588b00000000040143b5"

强制刷新

python script/detail/xiaohongshu_detail.py --note-id "6915588b00000000040143b5" --force

完整参数

参数	默认值	说明
--note-id	必填	笔记ID
--force	False	强制刷新
--no-cache	False	禁用缓存
--results-dir	data/detail	输出目录
--timeout	30	超时时间（秒）
--max-retries	5	最大重试次数
--retry-delay	2	重试延迟（秒）

核心特性

1. 自动缓存（默认开启）

相同的搜索参数会自动使用缓存：

# 第一次：请求 API
data1 = search_xiaohongshu("产品测试")

# 第二次：使用缓存
data2 = search_xiaohongshu("产品测试")  # 瞬间返回

# 强制刷新
data3 = search_xiaohongshu("产品测试", force=True)

2. 自动重试（失败重试 5 次）

超时错误：自动重试
连接错误：自动重试
5xx 服务器错误：自动重试
4xx 客户端错误：不重试

指数退避策略：2秒 → 4秒 → 8秒 → 16秒 → 32秒

3. 自动保存（后台完成）

搜索结果自动保存到 data/search/xiaohongshu_search/

目录结构：

data/search/xiaohongshu_search/
└── {关键词}/
    ├── raw/                           # 原始数据
    │   └── {时间戳}_page{页码}_{参数}.json
    └── clean/                         # 清洗数据
        └── {时间戳}_page{页码}_{参数}.json

文件名示例：

默认参数：20251113_133315_page1_不限_综合_不限.json
自定义参数：20251113_133258_page1_视频_最新_一周内.json

4. 自动翻页（内部处理 cursor）

# 无需手动管理 cursor
page1 = search_xiaohongshu("产品测试", page=1)
page2 = search_xiaohongshu("产品测试", page=2)  # 自动获取 page1 的 cursor
page3 = search_xiaohongshu("产品测试", page=3)  # 自动获取 page2 的 cursor

5. 关键词自动清理

特殊字符会自动处理，避免文件名冲突：

# 自动清理特殊字符
search_xiaohongshu("测试/产品:问题?")
# → 文件夹名：测试_产品_问题_

数据格式

Clean 数据（推荐使用）

{
  "search_params": {
    "keyword": "产品测试",
    "content_type": "视频",
    "sort_type": "最新",
    "publish_time": "一周内",
    "cursor": "",
    "page": 1,
    "timestamp": "20251113_133258"
  },
  "has_more": true,
  "next_cursor": "2@2fl1kgnh0gdx2oarsbpxc@...",
  "notes": [
    {
      "channel_content_id": "6915588b00000000040143b5",
      "link": "https://www.xiaohongshu.com/explore/6915588b00000000040143b5",
      "title": "笔记标题",
      "desc": "笔记摘要...",
      "body_text": "",
      "channel_account_name": "作者名称",
      "channel_account_id": "5b1e2c0811be10762dee6859",
      "like_count": 2,
      "comment_count": 0,
      "collect_count": 1,
      "shared_count": 0,
      "images": ["https://..."],
      "video": "",
      "content_type": "video"
    }
  ]
}

Raw 数据

完整的 API 响应，包含所有元数据和嵌套结构。

注意事项

关于 desc 和 body_text

desc：搜索接口返回的摘要（已截断）
body_text：完整正文（空，需调用详情接口 get_xhs_detail_by_note_id 获取）

关于 video

搜索接口不返回视频链接
需要调用详情接口获取

频率限制

建议每次搜索间隔 1-2 秒
避免短时间内大量请求

常见问题

Q: 如何获取完整正文？

A: 搜索接口只返回摘要，完整正文需要调用详情接口：

# 1. 先搜索获取笔记列表
data = search_xiaohongshu("产品测试")

# 2. 对感兴趣的笔记调用详情接口
note_id = data['notes'][0]['channel_content_id']
# 调用 get_xhs_detail_by_note_id(note_id) 获取完整正文

Q: 缓存如何清理？

方式1：手动删除 data/search/xiaohongshu_search/{关键词}/ 目录
方式2：使用 force=True 参数强制刷新

Q: 如何判断是否使用了缓存？

A: 看控制台输出：

使用缓存：✓ 使用缓存数据: ...
请求 API：正在搜索关键词: ... (尝试 1/3)

Q: 翻页时 cursor 在哪里？

A: cursor 已自动处理，无需手动管理：

# ✅ 推荐：直接指定页码
page2 = search_xiaohongshu("产品测试", page=2)

# ❌ 不需要：手动传 cursor
# page2 = search_xiaohongshu("产品测试", cursor="...")

技术细节

内部默认配置

超时时间：30 秒
最大重试：5 次
重试延迟：2 秒（指数增长）
缓存开关：默认开启
输出目录：data/search

缓存机制

基于搜索参数生成缓存键（keyword + content_type + sort_type + publish_time + cursor）
相同参数返回最新的缓存文件
按文件修改时间排序

自动翻页原理

# page=2 时自动执行：
# 1. 读取 page=1 的缓存
# 2. 提取 next_cursor
# 3. 使用 cursor 请求 page=2

README.md

小红书搜索模块

快速开始

Python API（推荐）

命令行工具

API 文档

1. 搜索接口

函数签名

返回值

笔记字段

2. 详情接口

函数签名

返回值

字段说明

使用示例

使用示例

1. 基本搜索

2. 带参数搜索

3. 翻页（自动处理）

4. 强制刷新

5. 批量搜索

6. 数据分析

7. 搜索 + 详情（完整正文）

命令行使用

搜索接口

基本搜索

带参数搜索

强制刷新

禁用缓存

完整参数

详情接口

基本使用

强制刷新

完整参数

核心特性

1. 自动缓存（默认开启）

2. 自动重试（失败重试 5 次）

3. 自动保存（后台完成）

4. 自动翻页（内部处理 cursor）

5. 关键词自动清理

数据格式

Clean 数据（推荐使用）

Raw 数据

注意事项

关于 desc 和 body_text

关于 video

频率限制

常见问题

Q: 如何获取完整正文？

Q: 缓存如何清理？

Q: 如何判断是否使用了缓存？

Q: 翻页时 cursor 在哪里？

技术细节

内部默认配置

缓存机制

自动翻页原理