CONFIGURATION.md 4.3 KB

AutoScraperX 配置说明

本文档详细说明了AutoScraperX项目的配置项。


环境配置说明

环境配置通过 .env 文件进行配置,以下为所有可配置项:

配置项 描述 是否必填 默认值
ENV 运行环境 (可选值: prod, dev) prod
DB_HOST 数据库主机地址
DB_PORT 数据库端口 3306
DB_USER 数据库用户名
DB_PASSWORD 数据库密码
DB_NAME 数据库名称
DB_CHARSET 数据库字符集
ROCKETMQ_ENDPOINT RocketMQ接入点
ROCKETMQ_ACCESS_KEY_ID RocketMQ访问密钥ID
ROCKETMQ_ACCESS_KEY_SECRET RocketMQ访问密钥
FEISHU_APPID 飞书应用ID
FEISHU_APPSECRET 飞书应用密钥
ALIYUN_ACCESS_KEY_ID 阿里云访问密钥ID
ALIYUN_ACCESS_KEY_SECRET 阿里云访问密钥
REDIS_HOST Redis主机地址
REDIS_PORT Redis端口 6379
REDIS_PASSWORD Redis密码

爬虫配置说明

爬虫配置通过 config/spiders_config.yaml 文件进行配置。

配置示例

default:
  base_url: http://8.217.192.46:8889
  request_timeout: 30
  max_retries: 3
  headers:
    {"Content-Type": "application/json"}

benshanzhufurecommend:
  platform: benshanzhufu
  mode: recommend
  path: /crawler/ben_shan_zhu_fu/recommend
  method: post
  request_body:
    cursor: "{{next_cursor}}"
  loop_times: 50
  loop_interval:
    min: 30
    max: 60
  feishu_sheetid: "aTSJH4"
  response_parse:
    data: "$.data"
    next_cursor: "$.data.next_cursor"
    data_path: "$.data.data"
    fields:
      video_id: "$.nid"
      video_title: "$.title"
      play_cnt: 0
      publish_time_stamp: "$.update_time"
      out_user_id: "$.nid"
      cover_url: "$.video_cover"
      like_cnt: 0
      video_url: "$.video_url"
      out_video_id: "$.nid"


yuannifuqimanmanrecommend:
  platform: yuannifuqimanman
  mode: recommend
  path: /crawler/yuan_ni_fu_qi_man_man/recommend
  method: post
  request_body:
    cursor: "{{next_cursor}}"
  loop_times: 100
  loop_interval:
    min: 30
    max: 60
  feishu_sheetid: "golXy9"
  response_parse:
    data: "$.data"
    next_cursor: "$.data.next_cursor"
    data_path: "$.data.data"
    fields:
      video_id: "$.nid"
      video_title: "$.title"
      out_user_id: "$.nid"
      cover_url: "$.video_cover"
      video_url: "$.video_url"
      out_video_id: "$.nid"

xiaoniangaoauthor:
  platform: xiaoniangao
  mode: author
  path: /crawler/xiao_nian_gao_plus/blogger
  method: post
  request_body:
      cursor: "{{next_cursor}}"
      account_id: "{{uid}}" # 数据库的uid
  loop_times: 100
  loop_interval:
    min: 5
    max: 20
  feishu_sheetid: "golXy9"
  response_parse:
    uid: "$.uid" # 数据库的uid
    next_cursor: "$.cursor"
    data: "$.data"
    has_more: "$.data.has_more"
    data_path: "$.data.data"
    fields:
      video_title: "$.title"
      duration: "$.du"
      play_cnt: "$.play_pv"
      like_cnt: "$.favor.total"
      comment_cnt: "$.comment_count"
      share_cnt: "$.share"
      width: "$.w"
      height: "$.h"
      avatar_url: "$.user.hurl"
      cover_url: "$.url"
      video_url: "$.v_url"
      out_user_id: "$.user.mid"
      out_video_id: "$.vid"
      publish_time_stamp: "$.t"



字段说明

全局配置字段

字段 描述
base_url 基础URL,用于拼接完整请求URL
request_timeout 请求超时时间(秒)
max_retries 最大重试次数
headers 请求头信息

平台配置字段

字段 描述
platform 平台名称
mode 爬取模式(如 recommend, author)
path API路径
url 完整请求URL
method HTTP请求方法
request_body 请求体参数
loop_times 循环次数
loop_interval 循环间隔(min/max)
response_parse 响应解析配置
feishu_sheetid 飞书表格ID

响应解析字段

字段 描述
data_path 数据列表路径
next_cursor 下一页游标路径
has_more 是否还有更多数据路径
fields 字段映射配置

当前配置状态

  • 平台配置数量: 3
  • 运行环境: prod
  • 配置文件路径: /AutoScraperX/config/spiders_config.yaml