List of project ideas for contributors applying to the Google Summer of Code program in 2026 (GSoC 2026).
CocoIndex is an ultra-performant data transformation framework for AI, with its core engine written in Rust. It help keeps AI systems fresh and reliable with incremental processing, at any scale.
- Repository: github.com/cocoindex-io/cocoindex
- Documentation: cocoindex.io/docs
- License: Apache 2.0
Please always refer to the official timeline of Google Summer of code
First of all, and if you have not done that yet, read the contributor guide which will allow you to understand all this process and how the program works overall. Refer to its left side menu to quickly access sections that may interest you the most, although we recommend you to read everything.
This is a required step unless you have dived into the existing codebase and understood everything perfectly (very hard) and the idea you prefer is on the list below.
If your idea is not listed, please discuss it with the mentors in the available contact channels. We're always open to new ideas and won't hesitate to choose them if you demonstrate to be a good candidate!
- You're committing to a project and we may ask you to publicly publish your weekly progress on it.
- We will ask you to give feedback on our mentorship and guidance.
- You wholeheartedly agree with our community values of being inclusive, welcoming, and supportive.
- You must tell us if there's any proposed idea that you don't think would fit the timeline or could be boring (yes, we're asking for feedback).
We recommend you to follow Google's guide to Writing a Proposal as we won't be too harsh on the format and we won't provide any template. But hey, we're giving you a starting point!
You can send the proposal link in any readable format you wish: Google Docs, plain text, markdown... and preferably hosted online, accessible with a common browser without downloading anything.
We highly recommend you to ask for a review anytime from the community or mentor candidates before the contributor application deadline. It's much easier if you get feedback early than to wait for the last moment.
You can also propose your own ideas!
Skills: Rust, TypeScript/JavaScript, Node.js, napi-rs or wasm-bindgen, npm packaging
Expected size of the project: Large (~350 hours)
Difficulty rating: Hard
Description:
CocoIndex currently provides a Python SDK that wraps its high-performance Rust core engine. This project aims to bring CocoIndex to the JavaScript/TypeScript ecosystem by building a complete SDK that enables Node.js developers to use CocoIndex's incremental data transformation capabilities.
The project involves three major components:
-
Rust Bindings for Node.js: Create native Node.js bindings using napi-rs (recommended) or WebAssembly. This mirrors the existing Python bindings and exposes the core engine to JavaScript.
-
TypeScript/JavaScript Library: Build the high-level SDK that provides an idiomatic JavaScript/TypeScript API, including function decorators/wrappers, type-safe target state declarations, async/await integration, and full TypeScript type definitions.
-
Node.js Examples: Create example applications demonstrating common use cases (file processing, database sync, etc.) that run on Node.js.
Expected outcomes:
- Rust crate with napi-rs bindings exposing the core engine to Node.js
- TypeScript package with idiomatic APIs matching Python SDK patterns
- 10 working examples
- npm package published and installable
- Documentation for getting started with the JS/TS SDK
- CI/CD pipeline for building and testing the bindings
Possible mentors:
- George He - georgehe0, cofounder & maintainer of CocoIndex, ex-Google Infra lead
- Linghua Jin - badmonster0, cofounder & maintainer of CocoIndex, ex-Google Tech lead
Resources:
- napi-rs documentation
- Existing Python SDK related code
- PyO3 documentation - understand current binding patterns
Skills: Python, code parsing (tree-sitter), vector databases, LLM APIs, MCP (Model Context Protocol)
Expected size of the project: Medium (~175 hours)
Difficulty rating: Medium
Description:
Build an intelligent code understanding engine powered by CocoIndex that extracts, indexes, and maintains a knowledge graph from Python codebases. The engine combines structural code analysis with LLM-powered summarization to enable AI agents to understand and reason about code.
Key capabilities:
-
Structured Code Extraction: Parse Python codebases using tree-sitter to extract coarse-grained entities (classes and functions) and their relationships (imports, calls). Extract existing docstrings and comments.
-
LLM Summarization: Generate summaries for classes and functions using LLM APIs, providing semantic understanding of what each code component does.
-
Incremental Updates: Leverage CocoIndex's incremental processing to efficiently update the knowledge graph when code changes—only re-parse and re-summarize modified entities.
-
MCP Server for AI Agents: Expose the indexed knowledge through a Model Context Protocol (MCP) server with a focused set of tools (e.g., search entities, get entity details, list relationships, get file overview).
Expected outcomes:
- Code parsing pipeline for Python using tree-sitter
- Knowledge graph capturing classes, functions, relationships, and summaries
- Incremental indexing that efficiently handles code changes
- MCP server with 3-4 essential tools for code understanding
- Example integration showing the MCP server working with an AI agent
- Documentation and usage guide
Optional/Stretch goals:
- Additional language support (TypeScript, Rust)
- Hierarchical aggregation (module-level summaries derived from their components)
Possible mentors:
- George He - georgehe0, cofounder & maintainer of CocoIndex, ex-Google Infra lead
- Linghua Jin - badmonster0, cofounder & maintainer of CocoIndex, ex-Google Tech lead
Resources:
- Tree-sitter - incremental parsing library
- Model Context Protocol (MCP) - protocol for AI tool integration
- CocoIndex documentation - understanding incremental processing
Skills: Python, graph algorithms, LLM APIs, understanding of RAG systems
Expected size of the project: Medium (~175 hours)
Difficulty rating: Medium
Description:
Implement Microsoft's GraphRAG technique using CocoIndex, creating a GraphRAG system that supports incremental processing. GraphRAG enhances retrieval-augmented generation by building a knowledge graph from documents—extracting entities and relationships, detecting communities, and generating summaries at multiple levels of abstraction.
GraphRAG involves multiple processing stages with different incrementalization characteristics:
- Easy to incrementalize: Entity/relationship extraction, text chunking, and per-chunk processing can run independently on each input document.
- Hard to incrementalize: Community detection and global summarization require a holistic view of the entire graph, making incremental updates challenging.
Scope:
Implement the full GraphRAG pipeline with CocoIndex. Incrementalize the straightforward stages (chunking, entity extraction, relationship extraction)—this is where most processing cost lies. For global stages (community detection, hierarchical summarization), rerun the entire stage when any input changes. Expose query capabilities through a simple MCP server.
Expected outcomes:
- Full GraphRAG pipeline implementation using CocoIndex
- Incremental processing for document-level stages (entity/relationship extraction)
- Working global stages (community detection, summarization) that rerun on changes
- Simple MCP server with tools for local and global GraphRAG queries
- Documentation and example usage
Optional/Stretch goals:
- Advanced incremental processing for global stages (may require CocoIndex engine work)
- Performance benchmarks comparing incremental vs. full reprocessing
Possible mentors:
- George He - georgehe0, cofounder & maintainer of CocoIndex, ex-Google Infra lead
- Linghua Jin - badmonster0, cofounder & maintainer of CocoIndex, ex-Google Tech lead
Resources:
- GraphRAG Paper - original research paper
- Microsoft GraphRAG - reference implementation
- CocoIndex documentation - understanding incremental processing
- Model Context Protocol (MCP) - protocol for AI tool integration
Skills: Python, performance profiling, data visualization, CI/CD pipelines
Expected size of the project: Medium (~175 hours)
Difficulty rating: Medium
Description:
Create a comprehensive benchmarking framework for CocoIndex that enables measuring, comparing, and reporting performance across different use cases. The framework serves three distinct audiences with different needs:
Use cases:
-
Engine Evaluation (for engine developers): Benchmark the CocoIndex core engine using curated application code, curated input datasets, and curated connectors. This helps measure engine performance improvements across releases and identify regressions.
-
Connector Evaluation (for connector developers): Benchmark specific connectors using curated application code and input data. Includes glue code adapters that feed standardized test data into each connector, enabling fair comparisons between connector implementations.
-
Application Evaluation (for application developers): Provide reusable infrastructure so developers can benchmark their own applications with their own data, using the same tooling and reporting capabilities.
Key components:
- Benchmark Suite: Curated set of representative workloads (small/medium/large datasets, various processing patterns)
- Runner Infrastructure: Automated execution with resource monitoring (CPU, memory, I/O, time)
- Scoring System: Standardized metrics for throughput, latency, incremental update efficiency, and resource usage
- Reporting: Generate human-readable reports and machine-readable outputs for CI integration
Expected outcomes:
- Benchmark suite with curated applications, datasets, and connector adapters
- CLI tool for running benchmarks locally
- Scoring system with well-defined metrics
- Report generation (HTML reports, JSON output for CI)
- Documentation for using the framework and adding new benchmarks
Optional/Stretch goals:
- CI integration for automated performance regression testing
- Historical tracking and trend visualization
- Comparison mode for A/B testing engine or connector changes
Possible mentors:
- George He - georgehe0, cofounder & maintainer of CocoIndex, ex-Google Infra lead
- Linghua Jin - badmonster0, cofounder & maintainer of CocoIndex, ex-Google Tech lead
Resources:
- CocoIndex documentation
- CocoIndex examples
- pytest-benchmark - Python benchmarking reference
- Criterion.rs - Rust benchmarking patterns
Skills: Python, testing frameworks, understanding of incremental processing and state management
Expected size of the project: Medium (~175 hours)
Difficulty rating: Medium
Description:
Build a comprehensive testing infrastructure for CocoIndex connectors. CocoIndex is an incremental processing engine—connectors bridge "states" and "changes" while application developers only think in terms of desired states. This means connector implementations must correctly handle a variety of state transition scenarios and edge cases that application developers never see.
Testing scenarios to cover:
-
Basic state transitions:
- Items added (new entries appear in source)
- Items deleted (entries removed from source)
- Items updated (entries modified in source)
- Mixed operations (combinations of add/delete/update in a single run)
-
Edge cases and failure modes:
- Idempotency: If an interrupt occurs between committing output to the target and committing metadata to CocoIndex's internal storage, the next run will re-trigger the same commit—connectors must handle this gracefully
- Partial failures: Some items succeed, others fail
- Empty states: No items, all items deleted
- Large batches: Many items changing at once
-
Incremental correctness:
- Verify that incremental updates produce the same final state as full reprocessing
- Detect state drift over multiple incremental runs
Key components:
- Test Harness: Framework for simulating state changes and running connectors through test scenarios
- Scenario Library: Pre-built test scenarios covering common and edge cases
- Fault Injection: Tools to simulate failures (interrupts, partial commits, network errors)
- State Verification: Utilities to compare expected vs. actual target state after operations
Expected outcomes:
- Reusable test harness for connector developers
- Library of test scenarios (basic transitions, edge cases, failure modes)
- Fault injection utilities for testing idempotency and recovery
- Documentation and examples for testing new connectors
- Tests applied to existing connectors as validation
Optional/Stretch goals:
- Property-based testing for generating random state transition sequences
- Integration with CI for automated connector testing
Possible mentors:
- George He - georgehe0, cofounder & maintainer of CocoIndex, ex-Google Infra lead
- Linghua Jin - badmonster0, cofounder & maintainer of CocoIndex, ex-Google Tech lead
Resources:
- CocoIndex documentation
- CocoIndex connectors
- pytest - Python testing framework
Join our community to discuss project ideas, get help, and connect with mentors:
- Discord:
- GitHub: github.com/cocoindex-io/cocoindex
- Twitter/X: @cocoindex_io
- YouTube: youtube.com/@cocoindex-io
- Email: hi@cocoindex.io
Before applying, we recommend familiarizing yourself with CocoIndex:
- Read the documentation: cocoindex.io/docs
- Try the quickstart: Getting Started Guide
- Watch tutorials: YouTube Channel
- Explore examples: github.com/cocoindex-io/cocoindex/tree/main/examples
- Join our Discord: Ask questions and introduce yourself!
This page explains how to write a strong GSoC proposal for CocoIndex, what we expect it to include, and how to get in touch with mentors.
We strongly prefer contributors who have already interacted with CocoIndex a bit:
- Read the CocoIndex docs and quickstart to understand what the project does and where your skills fit.
- Explore the GitHub repo (code layout, issues, examples).
- Join our public communication channels and introduce yourself.
- Make at least one small contribution - if possible (docs, tests, or a “good first issue” PR).
We recommend the following structure.
-
Title and project idea
- A short, descriptive title that includes “CocoIndex” and the idea name (for example, “CocoIndex: Incremental connector for X”).
- Link to the idea from our Ideas page or clearly mark it as a self‑proposed idea.
-
About you
- Name, email, GitHub, time zone, and expected weekly availability.
- Briefly describe your relevant experience: Rust, Python, data pipelines, databases, or AI/ML.
- Link to any open‑source work (including CocoIndex contributions, if any).
-
Project overview and motivation
- 3–5 sentences explaining what you want to build, who it helps, and why it matters for CocoIndex.
- In your own words, describe the problem you are solving and show that you’ve read the relevant docs/code.
-
Technical plan
- Break the work into clear phases (design, prototype, implementation, tests, docs, examples).
- For each phase, describe your approach: which components of CocoIndex you’ll touch, technologies used, and any initial design ideas.
- Mention risks or unknowns and how you plan to de‑risk them (spikes, early prototypes, mentor check‑ins).
-
Timeline and milestones
- Provide a week‑by‑week or phase‑by‑phase schedule that aligns with the GSoC timeline.
- List specific deliverables by the midterm evaluation and by the final evaluation (e.g., “connector supports basic CRUD and passes tests,” “benchmark suite for N scenarios,” “example notebook published”).
- Note any periods when you’ll be unavailable (exams, holidays, etc.).
-
Communication plan
- How often you plan to send progress updates (we expect at least two public updates per week plus a weekly mentor meeting).
- Your preferred communication channels (GitHub Discussions, chat, email) and how you’ll ask for help when blocked.
-
After GSoC
- A short paragraph on how you’d like to continue contributing to CocoIndex after the program (maintaining your feature, fixing bugs, writing docs, or mentoring new contributors).
We generally expect the following before we accept a proposal:
- You have successfully built and run CocoIndex locally (Rust and/or Python as appropriate).
- You have made at least one small public contribution to CocoIndex (or a clearly related repo), or you have had a substantial technical discussion with mentors about your idea.
- You have discussed your draft proposal with a mentor and refined scope based on their feedback.
- Check the CocoIndex GSoC page (linked from our README and docs) for:
- Current GSoC ideas and mentors.
- Links to our communication channels.
- Start by posting in the public channel or idea discussion with:
- A short intro,
- The idea you’re considering, and
- Any initial questions or draft plans.
Please also read the official GSoC guides on writing and submitting proposals, and submit your proposal early so mentors have time to review and give feedback before the deadline.
We're excited to welcome GSoC contributors to the CocoIndex community!