Преглед на файлове

Add data engineering methodology structure

SamLee преди 4 дни
родител
ревизия
e6d52b9bab

+ 56 - 0
AGENTS.md

@@ -0,0 +1,56 @@
+# Agent Guide
+
+This repository is a data-engineering knowledge base, not an application runtime repo.
+
+## Mission
+
+Help organize and verify business-domain analysis, technical implementation analysis, data lineage, evidence, and visualization-ready artifacts for AIGC-related modules.
+
+## Default Workflow
+
+1. Read this file, `README.md`, and the relevant `domains/<domain>/README.md`.
+2. If starting a new domain, create or update `01-recon-notes.md` before detailed analysis.
+3. Separate business logic, technical implementation, data lineage, and verification notes.
+4. Prefer code and DB verification over assumptions.
+5. Record stale, partial, blocked, or unverified claims explicitly.
+6. Keep domain-local facts under `domains/<domain>/`.
+7. Keep cross-domain joins and total graphs under `integrated/`.
+
+## Safety Rules
+
+- Do not print or copy raw passwords, tokens, AK/SK, API keys, or cookie values into reports.
+- DB verification should use read-only queries when possible.
+- If a table/file contains sensitive fields, describe the field existence and role without exposing secret values.
+- Do not silently overwrite manually curated Excel or Markdown files.
+- Do not move existing analysis files unless the user explicitly asks for that move.
+
+## Repository Roles
+
+- `skills/` defines reusable methods.
+- `templates/` defines fill-in artifacts for new business domains.
+- `schemas/` defines the target machine-readable lineage format.
+- `domains/` stores per-domain analysis and data files.
+- `integrated/` stores the cross-domain graph and aggregate artifacts.
+- `archive/` stores old or superseded material.
+
+## Expected Domain Files
+
+Each business domain should eventually contain:
+
+- `README.md`
+- `01-recon-notes.md`
+- `02-business-logic.md`
+- `03-technical-implementation.md`
+- `04-data-lineage.md`
+- `05-verification.md`
+- domain-specific `.xlsx` files
+- optional `lineage.snapshot.json`
+
+## Current Domain Names
+
+- `clustering`
+- `pattern-v2`
+- `demand-agent`
+- `external-sources`
+- `building-system-agent`
+- `search-agent`

+ 24 - 1
README.md

@@ -1,3 +1,26 @@
 # data-engineering
 
-数据工程追踪aigc模块各业务域
+数据工程追踪 AIGC 模块各业务域。
+
+这个仓库用于沉淀多业务域的数据工程分析、数据血缘追踪、统一 Excel/JSON 格式和可视化方法论。当前重点业务域包括聚类管线、Pattern V2、Demand Agent、外部数据源、BuildingSystemAgent,以及后续 Search Agent。
+
+## Directory Map
+
+- `AGENTS.md`:给 Codex / sub-agent 的项目总规则和默认工作流。
+- `skills/`:沉淀 7 个可复用分析 Skill。
+- `templates/`:新业务域分析时复制使用的 Markdown / JSON 模板。
+- `schemas/`:统一 lineage JSON / Excel 字段规范。
+- `domains/`:每个业务域一个文件夹,放业务拆解、技术拆解、血缘、核验和该域数据文件。
+- `integrated/`:跨业务域总图、总 Excel、总 JSON。
+- `archive/`:历史版本、旧口径、一次性中间材料。
+
+## Domain Workflow
+
+每个业务域推荐按这个顺序沉淀:
+
+1. `01-recon-notes.md`:侦察备忘录,记录入口、疑似表、疑似产物和不确定点。
+2. `02-business-logic.md`:业务逻辑拆解,说明业务对象如何变成业务结果。
+3. `03-technical-implementation.md`:技术实现拆解,说明入口、代码、组件、算法和状态控制。
+4. `04-data-lineage.md`:数据血缘拆解,说明数据从哪里来、怎么加工、写到哪里。
+5. `05-verification.md`:核验记录,说明代码、DB、样例、风险和漂移。
+6. `lineage.snapshot.json`:未来给可视化和自动化消费的统一结构化产物。

+ 5 - 0
archive/README.md

@@ -0,0 +1,5 @@
+# Archive
+
+Use this folder for old versions, superseded reports, obsolete Excel files, and one-off intermediate material.
+
+Do not delete historical materials unless the user explicitly asks.

+ 14 - 0
domains/README.md

@@ -0,0 +1,14 @@
+# Domains
+
+Each business domain gets one folder. Domain files should explain the business logic, technical implementation, data lineage, and verification evidence for that domain.
+
+Current planned domains:
+
+- `clustering`
+- `pattern-v2`
+- `demand-agent`
+- `external-sources`
+- `building-system-agent`
+- `search-agent`
+
+Do not move shared cross-domain artifacts here; put them under `integrated/`.

+ 13 - 0
domains/building-system-agent/README.md

@@ -0,0 +1,13 @@
+# BuildingSystemAgent
+
+Put BuildingSystemAgent-domain artifacts here.
+
+Recommended first lineage scope:
+
+- Demand or user input
+- `topic_build_*`
+- `script_build_*`
+- `external_search_case_log`
+- item/source/relation traces
+- decision traces
+- upstream Pattern API evidence anchors

+ 13 - 0
domains/clustering/README.md

@@ -0,0 +1,13 @@
+# Clustering
+
+Put clustering-domain artifacts here.
+
+Recommended files:
+
+- `01-recon-notes.md`
+- `02-business-logic.md`
+- `03-technical-implementation.md`
+- `04-data-lineage.md`
+- `05-verification.md`
+- `data_clustering_statistic.xlsx`
+- `lineage.snapshot.json`

+ 13 - 0
domains/demand-agent/README.md

@@ -0,0 +1,13 @@
+# Demand Agent
+
+Put Demand Agent-domain artifacts here.
+
+Recommended files:
+
+- `01-recon-notes.md`
+- `02-business-logic.md`
+- `03-technical-implementation.md`
+- `04-data-lineage.md`
+- `05-verification.md`
+- `demand_engineering_statistic.xlsx`
+- `lineage.snapshot.json`

+ 5 - 0
domains/external-sources/README.md

@@ -0,0 +1,5 @@
+# External Sources
+
+Put external-source artifacts here.
+
+This domain should track ODPS tables, crawler/API inputs, external search services, model/embedding APIs, local JSON imports, browser-helper databases, and other non-owned upstream systems.

+ 12 - 0
domains/pattern-v2/README.md

@@ -0,0 +1,12 @@
+# Pattern V2
+
+Put Pattern V2-domain artifacts here.
+
+Recommended files:
+
+- `01-recon-notes.md`
+- `02-business-logic.md`
+- `03-technical-implementation.md`
+- `04-data-lineage.md`
+- `05-verification.md`
+- `lineage.snapshot.json`

+ 5 - 0
domains/search-agent/README.md

@@ -0,0 +1,5 @@
+# Search Agent
+
+Put future Search Agent artifacts here.
+
+This domain is currently reserved for the planned search-agent feature based on traceable feature/data engineering outputs.

+ 12 - 0
integrated/README.md

@@ -0,0 +1,12 @@
+# Integrated Lineage
+
+Use this folder for cross-domain artifacts.
+
+Examples:
+
+- full lineage overview
+- cross-domain Excel
+- cross-domain `lineage.snapshot.json`
+- visualization-ready merged graph
+
+Domain-local facts should stay under `domains/<domain>/`; this folder should reference and connect them.

+ 12 - 0
schemas/README.md

@@ -0,0 +1,12 @@
+# Schemas
+
+This folder defines the target machine-readable formats for data lineage artifacts.
+
+Current target:
+
+- `lineage-snapshot.md`: human-readable field specification for `lineage.snapshot.json`.
+
+Future target:
+
+- JSON Schema for automated validation.
+- Excel sheet specification generated from the JSON schema.

+ 28 - 0
schemas/lineage-snapshot.md

@@ -0,0 +1,28 @@
+# Lineage Snapshot Schema
+
+`lineage.snapshot.json` is the canonical machine-readable artifact for a business domain.
+
+## Top-Level Fields
+
+| Field | Meaning |
+| --- | --- |
+| `schema_version` | Format version. |
+| `domain_id` | Stable domain identifier. |
+| `generated_at` | ISO timestamp. |
+| `modules` | Domain or sub-domain lanes. |
+| `data_nodes` | Tables, files, APIs, caches, memory objects, business objects. |
+| `process_steps` | Business/technical processing steps. |
+| `edges` | Relationships between data nodes and process steps. |
+| `field_mappings` | Important field-level lineage. |
+| `evidence` | Code, SQL, report, sample, or DB evidence. |
+| `validation_checks` | Verification state and drift notes. |
+| `layout_overrides` | Visualization-only metadata. |
+
+## Validation Status
+
+- `verified`
+- `code_verified`
+- `db_verified`
+- `partial`
+- `stale`
+- `blocked`

+ 33 - 0
skills/01-domain-recon/SKILL.md

@@ -0,0 +1,33 @@
+# Domain Recon Skill
+
+Use this skill when a business domain is still unclear and needs initial orientation before formal analysis.
+
+## Goal
+
+Produce a reconnaissance memo that identifies the likely business boundary, entrypoints, data sources, output artifacts, run anchors, and unknowns.
+
+## Inputs
+
+- Repository path
+- Existing notes or reports
+- Suspected module name
+- Optional sample IDs, execution IDs, build IDs, or filenames
+
+## Procedure
+
+1. Read README, configs, entrypoints, and likely service scripts.
+2. Search for domain keywords, table names, API routes, job schedulers, and agent/tool names.
+3. Identify likely inputs, outputs, and state anchors.
+4. Mark unknowns instead of guessing.
+5. Produce `01-recon-notes.md`.
+
+## Output Sections
+
+- Business domain name
+- Suspected purpose
+- Suspected entrypoints
+- Suspected core tables/files/APIs
+- Suspected final business outputs
+- Important run anchors
+- Known risks or ambiguity
+- Next deep-dive questions

+ 28 - 0
skills/02-business-logic/SKILL.md

@@ -0,0 +1,28 @@
+# Business Logic Skill
+
+Use this skill after recon has established the rough domain boundary.
+
+## Goal
+
+Explain what the business domain does in business terms, while still naming the technical components that materially shape the business result.
+
+## Procedure
+
+1. Define the business problem.
+2. List input business objects.
+3. List intermediate business objects.
+4. Explain algorithm, agent, or human judgment points.
+5. Identify final business outputs.
+6. Map upstream and downstream boundaries.
+
+## Output Sections
+
+- One-sentence conclusion
+- Business goal
+- Input objects
+- Processing stages
+- Decision points
+- Final outputs
+- Upstream dependencies
+- Downstream consumers
+- Common misunderstandings

+ 27 - 0
skills/03-technical-implementation/SKILL.md

@@ -0,0 +1,27 @@
+# Technical Implementation Skill
+
+Use this skill to verify how the business logic is actually implemented in code.
+
+## Goal
+
+Tie business stages to real entrypoints, functions, tools, jobs, APIs, databases, local files, caches, and status controls.
+
+## Procedure
+
+1. Identify API, CLI, scheduler, script, or agent entrypoints.
+2. Trace the main code path.
+3. Identify DB managers, ORM models, table names, and external APIs.
+4. Identify algorithm and component boundaries.
+5. Separate active default paths from optional, old, disabled, or compatibility paths.
+6. Record exact code evidence.
+
+## Output Sections
+
+- Runtime entrypoints
+- Core code path
+- Core components
+- Configuration and secrets handling
+- State and execution controls
+- Active vs optional paths
+- Code evidence index
+- Technical risks and drift

+ 26 - 0
skills/04-data-lineage/SKILL.md

@@ -0,0 +1,26 @@
+# Data Lineage Skill
+
+Use this skill when business and technical implementation are clear enough to model data flow.
+
+## Goal
+
+Represent where data comes from, how it is processed, what intermediate artifacts exist, what final outputs are written, and how downstream modules consume them.
+
+## Procedure
+
+1. Convert business stages into process steps.
+2. Convert tables/files/APIs/caches/memory objects into data nodes.
+3. Add lineage edges between nodes and steps.
+4. Add field-level mappings for important IDs and business fields.
+5. Mark intermediate artifacts separately from final business outputs.
+6. Record evidence for every material edge.
+
+## Output Sections
+
+- Data source inventory
+- Process-step lineage
+- Field-level lineage
+- Intermediate artifacts
+- Final outputs
+- Downstream consumption
+- Evidence and verification status

+ 35 - 0
skills/05-data-verification/SKILL.md

@@ -0,0 +1,35 @@
+# Data Verification Skill
+
+Use this skill to verify that reports, Excel files, and lineage claims match code and live or historical data.
+
+## Goal
+
+Assign evidence-backed confidence to each lineage claim.
+
+## Verification Status
+
+- `verified`: code and DB/sample evidence both checked.
+- `code_verified`: code checked, DB not checked.
+- `db_verified`: DB checked, code path not fully traced.
+- `partial`: only some sources checked.
+- `stale`: previously true but live data drifted.
+- `blocked`: cannot verify because of missing access, client, credentials, or environment.
+
+## Procedure
+
+1. Prefer read-only DB queries.
+2. Query metadata before row values.
+3. Verify row counts, schema fields, statuses, current flags, and sample IDs.
+4. Compare live data with existing reports.
+5. Record drift explicitly.
+6. Never expose secret values.
+
+## Output Sections
+
+- Verification scope
+- DB checks
+- Code checks
+- Sample checks
+- Drift findings
+- Blockers
+- Evidence log

+ 34 - 0
skills/06-unified-format/SKILL.md

@@ -0,0 +1,34 @@
+# Unified Format Skill
+
+Use this skill to turn domain-specific notes and Excel files into a standard Excel/JSON format.
+
+## Goal
+
+Produce a stable, machine-readable lineage representation while keeping a human-readable Excel/Markdown review layer.
+
+## Canonical Objects
+
+- `modules`
+- `data_nodes`
+- `process_steps`
+- `edges`
+- `field_mappings`
+- `evidence`
+- `validation_checks`
+- `layout_overrides`
+
+## Procedure
+
+1. Normalize business stages into `process_steps`.
+2. Normalize tables/files/APIs/caches into `data_nodes`.
+3. Normalize flow relations into `edges`.
+4. Normalize important field transformations into `field_mappings`.
+5. Attach evidence and verification states.
+6. Export domain-local `lineage.snapshot.json`.
+7. Optionally generate or update the domain Excel view.
+
+## Output
+
+- `lineage.snapshot.json`
+- optional domain-specific `.xlsx`
+- optional schema validation report

+ 24 - 0
skills/07-visualization-mapping/SKILL.md

@@ -0,0 +1,24 @@
+# Visualization Mapping Skill
+
+Use this skill to map unified lineage JSON into a visual graph.
+
+## Goal
+
+Render lineage in a way that supports module overview, module-level drilldown, field evidence, validation status, and cross-domain handoffs.
+
+## Procedure
+
+1. Use `modules` as swimlanes or top-level tabs.
+2. Use `data_nodes` and `process_steps` as graph nodes.
+3. Use `edges` for graph connections.
+4. Use `validation_checks` to style risk and confidence.
+5. Keep layout metadata separate from business data.
+6. Prefer drilldown over one giant unreadable graph.
+
+## Output Views
+
+- Cross-domain overview
+- Single-domain detailed graph
+- Node detail panel
+- Field lineage panel
+- Evidence and verification panel

+ 15 - 0
skills/README.md

@@ -0,0 +1,15 @@
+# Skills
+
+These skills define the reusable methodology for turning a messy business/code/data area into verified lineage artifacts.
+
+Recommended order:
+
+1. `01-domain-recon`
+2. `02-business-logic`
+3. `03-technical-implementation`
+4. `04-data-lineage`
+5. `05-data-verification`
+6. `06-unified-format`
+7. `07-visualization-mapping`
+
+The skills are intentionally separated so a human or agent can stop after any phase, review the output, and then continue with the next phase.

+ 13 - 0
templates/README.md

@@ -0,0 +1,13 @@
+# Templates
+
+Copy these templates into a `domains/<domain>/` folder when starting or standardizing a business domain.
+
+Recommended file mapping:
+
+- `domain-readme.md` -> `README.md`
+- `recon-notes.md` -> `01-recon-notes.md`
+- `business-logic.md` -> `02-business-logic.md`
+- `technical-implementation.md` -> `03-technical-implementation.md`
+- `data-lineage.md` -> `04-data-lineage.md`
+- `verification.md` -> `05-verification.md`
+- `lineage.snapshot.template.json` -> `lineage.snapshot.json`

+ 23 - 0
templates/business-logic.md

@@ -0,0 +1,23 @@
+# Business Logic
+
+## One-Sentence Conclusion
+
+## Business Goal
+
+## Input Business Objects
+
+## Intermediate Business Objects
+
+## Processing Stages
+
+## Decision Points
+
+## Technical Components Used
+
+Examples: agent, LLM, embedding, DBSCAN, FP-Growth, PrefixSpan, ODPS, PostgreSQL, MySQL, HTTP API, local cache.
+
+## Final Business Outputs
+
+## Upstream / Downstream Boundary
+
+## Common Misunderstandings

+ 28 - 0
templates/data-lineage.md

@@ -0,0 +1,28 @@
+# Data Lineage
+
+## Overview
+
+```text
+source -> process -> intermediate -> final output -> downstream
+```
+
+## Data Nodes
+
+| Node | Type | Location | Business Meaning | Final Output |
+| --- | --- | --- | --- | --- |
+
+## Process Steps
+
+| Step | Business Question | Input | Processing | Output | Evidence |
+| --- | --- | --- | --- | --- | --- |
+
+## Field Mappings
+
+| Source Field | Transform | Target Field | Join / Run Key | Evidence |
+| --- | --- | --- | --- | --- |
+
+## Intermediate Artifacts
+
+## Final Outputs
+
+## Cross-Domain Handoffs

+ 35 - 0
templates/domain-readme.md

@@ -0,0 +1,35 @@
+# <Domain Name>
+
+## Current Status
+
+- Recon:
+- Business logic:
+- Technical implementation:
+- Data lineage:
+- Verification:
+- Visualization:
+
+## Boundary
+
+Describe what this domain owns and what it only consumes from upstream systems.
+
+## Key Outputs
+
+- 
+
+## Upstream
+
+- 
+
+## Downstream
+
+- 
+
+## Files
+
+- `01-recon-notes.md`
+- `02-business-logic.md`
+- `03-technical-implementation.md`
+- `04-data-lineage.md`
+- `05-verification.md`
+- `lineage.snapshot.json`

+ 13 - 0
templates/lineage.snapshot.template.json

@@ -0,0 +1,13 @@
+{
+  "schema_version": "0.1.0",
+  "domain_id": "<domain-id>",
+  "generated_at": "<iso8601>",
+  "modules": [],
+  "data_nodes": [],
+  "process_steps": [],
+  "edges": [],
+  "field_mappings": [],
+  "evidence": [],
+  "validation_checks": [],
+  "layout_overrides": []
+}

+ 21 - 0
templates/recon-notes.md

@@ -0,0 +1,21 @@
+# Recon Notes
+
+## Domain
+
+## Initial Hypothesis
+
+## Suspected Entrypoints
+
+## Suspected Data Sources
+
+## Suspected Intermediate Artifacts
+
+## Suspected Final Outputs
+
+## Run Anchors
+
+Examples: `execution_id`, `build_id`, `task_id`, `dt`, `post_id`, `videoid`.
+
+## Unknowns
+
+## Next Checks

+ 24 - 0
templates/technical-implementation.md

@@ -0,0 +1,24 @@
+# Technical Implementation
+
+## Runtime Entrypoints
+
+## Schedulers / API / CLI
+
+## Main Code Path
+
+## Core Components
+
+## Data Access Layer
+
+## Algorithms / Agents / Tools
+
+## Status and Execution Controls
+
+## Active vs Optional / Legacy Paths
+
+## Code Evidence
+
+| Object | Path | Note |
+| --- | --- | --- |
+
+## Risks and Drift

+ 24 - 0
templates/verification.md

@@ -0,0 +1,24 @@
+# Verification
+
+## Scope
+
+## Code Verification
+
+| Claim | Evidence | Status |
+| --- | --- | --- |
+
+## DB Verification
+
+| Object | Query / Method | Result | Status |
+| --- | --- | --- | --- |
+
+## Sample Verification
+
+| Sample ID | Path | Result | Status |
+| --- | --- | --- | --- |
+
+## Drift / Risk
+
+## Blockers
+
+## Checked At