Skip to content

Feat: Ingest GitHub Issue Comments into RAG pipeline #9

@Sayan4496

Description

@Sayan4496

Feature: Ingest GitHub Issue Comments into RAG pipeline

Summary

Currently, the RAG ingestion pipeline indexes GitHub issue bodies (see Issues ingestion component). However, the most valuable troubleshooting information and workarounds are typically found in issue comments.

This issue proposes adding a new Kubeflow Pipelines component to ingest GitHub Issue comments and attach them to the issue content for indexing.

Motivation

Indexing only issue title/body limits retrieval quality. With comments ingestion, the assistant can answer:

  • "What workaround exists for error X?"
  • "What did maintainers recommend in issue Y?"
  • "Is this fixed? Which version?"

Proposed Solution

Add a new KFP component:

download_github_issue_comments

A component that:

  • Takes a JSONL dataset of issues (output of download_github_issues)
  • Fetches comments for each issue using GitHub REST API:
    GET /repos/{owner}/{repo}/issues/{issue_number}/comments
  • Appends comments to the issue content (including comment author and timestamp)
  • Outputs JSONL in the same schema for compatibility with chunk_and_embed

Pipeline Flow

download_github_issues (existing/new)
             ↓
download_github_issue_comments (new)
             ↓
chunk_and_embed (existing)
             ↓
store_milvus (existing)

Acceptance Criteria

  • Comments are ingested and appended to issue content
  • PRs are still skipped
  • Output remains JSONL and is compatible with existing embedding + Milvus steps
  • Handles pagination and rate limits gracefully

/cc @kubeflow/docs-agent-maintainers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions