-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Feature: Ingest GitHub Issue Comments into RAG pipeline
Summary
Currently, the RAG ingestion pipeline indexes GitHub issue bodies (see Issues ingestion component). However, the most valuable troubleshooting information and workarounds are typically found in issue comments.
This issue proposes adding a new Kubeflow Pipelines component to ingest GitHub Issue comments and attach them to the issue content for indexing.
Motivation
Indexing only issue title/body limits retrieval quality. With comments ingestion, the assistant can answer:
- "What workaround exists for error X?"
- "What did maintainers recommend in issue Y?"
- "Is this fixed? Which version?"
Proposed Solution
Add a new KFP component:
download_github_issue_comments
A component that:
- Takes a JSONL dataset of issues (output of
download_github_issues) - Fetches comments for each issue using GitHub REST API:
GET /repos/{owner}/{repo}/issues/{issue_number}/comments - Appends comments to the issue content (including comment author and timestamp)
- Outputs JSONL in the same schema for compatibility with
chunk_and_embed
Pipeline Flow
download_github_issues (existing/new)
↓
download_github_issue_comments (new)
↓
chunk_and_embed (existing)
↓
store_milvus (existing)
Acceptance Criteria
- Comments are ingested and appended to issue content
- PRs are still skipped
- Output remains JSONL and is compatible with existing embedding + Milvus steps
- Handles pagination and rate limits gracefully
/cc @kubeflow/docs-agent-maintainers
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels