Talegorithm f44bf24485 refactor: move interactive menu to agent/cli		3 дней назад
..
README.md	f44bf24485 refactor: move interactive menu to agent/cli	3 дней назад
run_pipeline.py	f44bf24485 refactor: move interactive menu to agent/cli	3 дней назад
step1_analyze.py	f44bf24485 refactor: move interactive menu to agent/cli	3 дней назад
step2_build_sft.py	f44bf24485 refactor: move interactive menu to agent/cli	3 дней назад

长篇小说 SFT 数据生成

将网文/剧本逆向拆解为"AI 可学习的思考步骤"，生成三类 SFT 训练数据。

设计思路

整体流程

原文 txt
  │
  ▼  step1_analyze.py（一次 LLM 调用，500K 窗口）
分析 JSON
  [outline / characters / beats]
  │
  ▼  step2_build_sft.py（每个 beat 2-3 次 LLM 调用）
三类 JSONL
  task1_structure_planning.jsonl
  task2_scene_continuation.jsonl
  task3_shuang_injection.jsonl

Beat 切分与定位

分析时将全文按 Scene-Sequel 结构切分为若干 beat（叙事单元）。

Beat 边界通过 文本锚点 定位：LLM 从原文逐字复制每个 beat 开头的 20-25 个字符，Python 用 str.find() 精确定位字符位置（渐进缩短前缀至 8 字兜底）。不依赖章节标题格式，适用于任意命名风格。

三个 SFT 任务

Task 1 — 结构规划（Structure Planning）

目标：让模型学会在给定故事状态时，规划下一个 Scene-Sequel 单元的结构。

	内容
输入	故事状态（MICE 线程、上一个 Disaster/Decision、当前位置）+ 上文（最近 800 字）
输出	`<think>` 叙事状态分析 + 续写决策 `</think>` + 结构规划 JSON

输出 JSON 字段：scene（goal/conflict_type/disaster/pacing）、sequel（reaction/dilemma/decision）、hooks、shuang_point、mice_advancement

数据来源：以该 beat 的实际结构作为参考信息，由 LLM 逆向生成"事前规划"视角的 CoT，用户侧输入不包含 beat 实际内容。

Task 2 — 场景续写（Scene Continuation）

目标：让模型学会根据结构规划生成正文。

	内容
输入	上文（500-1500 字）+ Task 1 输出的结构规划
输出	`<think>` 上文理解 + 写法决策 `</think>` + 续写正文

数据来源：CoT 由 LLM 生成（给定 Task 1 规划 + beat 前 300 字 hint），正文使用原文 beat 文本作为 ground truth。

Task 3 — 爽点注入（Shuang Point Injection）

目标：让模型学会将平淡草稿改写为带爽点的版本。

	内容
输入	平淡草稿 + 爽点类型（打脸/升级/装逼/获得/碾压）+ 强度（low/medium/high）
输出	`<think>` 草稿分析 + 爽点设计 `</think>` + 增强版正文 + 修改说明

数据来源：仅处理分析中标记 has_shuang=true 的 beat。LLM 从原文生成"平淡草稿"（去掉爽点保留情节），原文作为增强版 ground truth。

多窗口连贯性

超过 500K 字符的小说分多个窗口处理，后续窗口通过 --prev-analysis 接收前一窗口的人物/线索元信息，确保全书人物关系和 MICE 线程不断档。

文件结构

sft/
  step1_analyze.py      # 500K 窗口分析 → analysis JSON
  step2_build_sft.py    # analysis JSON → 三类 JSONL
  run_pipeline.py       # 一键批量运行，支持断点续跑
  README.md

runs/{书名}/            # 运行输出（由 run_pipeline.py 自动创建）
  analysis/
    w0.json             # 第 0 个窗口分析结果
    w1.json             # 第 1 个窗口分析结果（如有）
  sft_raw/
    w0/                 # 第 0 个窗口的 SFT 数据
      task1_structure_planning.jsonl
      task2_scene_continuation.jsonl
      task3_shuang_injection.jsonl
      stats.json
  merged/               # 所有窗口合并后的最终数据
    task1_structure_planning.jsonl
    task2_scene_continuation.jsonl
    task3_shuang_injection.jsonl
    stats.json
  pipeline.log          # 运行日志

环境配置

# .env（项目根目录）
ALI_API_KEY=sk-...
ALI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1

依赖：

pip install openai python-dotenv

用法

一键运行（推荐）

cd examples/analyze_story/sft

python run_pipeline.py --novel ../input/大奉打更人.txt

断点续跑（重新执行同一命令，已完成的窗口自动跳过）：

python run_pipeline.py --novel ../input/大奉打更人.txt

常用参数

--novel           小说 txt 文件路径（必填）
--output-dir      输出根目录（默认 sft/runs/{书名}/）
--window-size     每窗口字符数（默认 500000）
--model           模型名称（默认 qwen-plus）
--context-chars   上文字符数，Task1/2 使用（默认 800）
--concurrency     step2 并发调用数（默认 5）
--skip-task N     跳过某个任务，可多次指定（例：--skip-task 3）
--only-step 1     只跑分析，不生成 SFT
--only-step 2     只生成 SFT（需要 analysis/ 已存在）
--force           强制重跑，忽略已有文件

批量处理多本书

for f in ../input/*.txt; do
    python run_pipeline.py --novel "$f" --concurrency 8
done

单独调用

# 只分析第一个窗口
python step1_analyze.py \
  --novel ../input/大奉打更人.txt \
  --output runs/大奉打更人/analysis/w0.json

# 只生成 SFT 数据
python step2_build_sft.py \
  --analysis runs/大奉打更人/analysis/w0.json \
  --novel ../input/大奉打更人.txt \
  --output-dir runs/大奉打更人/sft_raw/w0/

JSONL 格式

每行一条训练样本：

{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user",   "content": "..."},
    {"role": "assistant", "content": "<think>\n...\n</think>\n\n..."}
  ],
  "metadata": {
    "task_type": "structure_planning | scene_continuation | shuang_injection",
    "source_file": "大奉打更人",
    "chapter": "第4章",
    "position_percent": 3.8,
    "mice_thread": "税银案",
    "beat_id": "beat_003",
    "word_count": 3200
  }
}