In this example, we use CocoIndex Custom Source to define a source to get HackerNews recent content by calling HackerNews API. We build an index for HackerNews threads and their comments, and use LLM to extract trending topics from the text.
The pipeline uses ExtractByLlm to identify topics like product names, technologies, models, and company names mentioned in threads and comments, storing them in canonical form (avoiding acronyms unless very popular).
We appreciate a star ⭐ at CocoIndex Github if this is helpful.
- Custom Source Integration: Fetches HackerNews threads and comments via API
- LLM Topic Extraction: Automatically extracts topics using
ExtractByLlmfunction - Canonical Topic Forms: Topics are stored in canonical form (e.g., "Large Language Model" instead of "LLM")
- Multiple Query Handlers:
search_by_topic: Search content by specific topicget_trending_topics: Get trending topics ranked by mention count
- We define a custom source connector
HackerNewsto get HackerNews recent threads by calling HackerNews API. - For each thread and comment, we extract topics using LLM (
ExtractByLlm). - We build two indexes:
hn_messages: Full text of threads and commentshn_topics: Extracted topics with references to their source content, keyed by (topic, message_id)
Install Postgres if you don't have one.
Install dependencies:
pip install -e .Update the target:
cocoindex update mainEach time when you run the update command, cocoindex will only re-process threads that have changed, and keep the target in sync with the recent 500 threads from HackerNews.
You can also run update command in live mode, which will keep the target in sync with the source continuously:
cocoindex update -L main.pyAfter running the pipeline, you can query the extracted topics:
# Get trending topics
cocoindex query main.py get_trending_topics --limit 20
# Search content by specific topic
cocoindex query main.py search_by_topic --topic "Claude"
# Search by text content
cocoindex query main.py search_text --query "artificial intelligence"I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with Zero pipeline data retention. Run following command to start CocoInsight:
cocoindex server -ci -L main
Then open the CocoInsight UI at https://cocoindex.io/cocoinsight.