wangkun 2 lat temu
commit
bbc7eec75a
2 zmienionych plików z 92 dodań i 0 usunięć
  1. 63 0
      .gitignore
  2. 29 0
      README.MD

+ 63 - 0
.gitignore

@@ -0,0 +1,63 @@
+# ---> Python
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*,cover
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+.DS_Store
+.idea/
+

+ 29 - 0
README.MD

@@ -0,0 +1,29 @@
+# 微信公众号爬虫
+
+#### 文档链接
+* [Git](https://git.yishihui.com/Server/crawler_gongzhonghao.git)
+* [Jenkins]()
+* [公众号_信欣_爬虫表](https://w42nne6hzg.feishu.cn/sheets/shtcna98M2mX7TbivTj9Sb7WKBN?sheet=47e39d)
+* [需求文档](https://w42nne6hzg.feishu.cn/docx/KUuydSH8uouFoUxzfYmcxYmQnsf)
+
+#### 软件架构
+* python==3.10
+* loguru==0.6.0
+* oss2==2.15.0
+* psutil==5.9.2
+* requests==2.27.1
+* selenium==4.4.3
+* urllib3==1.26.9
+* ffmpeg==1.4
+* urllib3==1.26.9
+
+#### 使用说明
+* cd ./crawler_gongzhonghao && sh gongzhonghao.sh
+* 或者,Jenkins 重新构建
+
+#### 更新记录
+2023/01/17
+* 对文章中全部视频进行抓取
+* 根据视频 ID 去重的基础上,再进行视频标题相似度进行排重,>80%即认为重复内容
+* 抓取完一个人,休眠 1 分钟;抓取完所有人,休眠 1 小时
+* 站内承接账号[]