Skip to content

Conversation

@alltheseas
Copy link
Contributor

@alltheseas alltheseas commented Dec 25, 2025

Summary

Adds dynamic sitemap generation and robots.txt to improve search engine discoverability of Nostr content.

Closes #26

New Routes

Route Content-Type Cache Description
/robots.txt text/plain 24h Crawler directives + sitemap reference
/sitemap.xml application/xml 1h Dynamic sitemap from nostrdb cache

Sitemap Content

Queries local nostrdb cache for:

  • Notes (kind:1) → note1xxx URLs, 90-day lookback
  • Articles (kind:30023) → naddr1xxx URLs, 365-day lookback (evergreen content)
  • Profiles (kind:0) → npub1xxx URLs

Limited to 10,000 URLs per content type.

robots.txt

User-agent: *
Allow: /
Allow: /.well-known/nostr.json
Disallow: /metrics
Disallow: /*.json

Sitemap: https://damus.io/sitemap.xml

Prometheus Metrics

Metric Type Description
sitemap_generations_total counter Generation requests
sitemap_generation_duration_seconds gauge Time to generate
sitemap_urls_total gauge Total URLs
sitemap_notes_count gauge Notes included
sitemap_articles_count gauge Articles included
sitemap_profiles_count gauge Profiles included

Configuration

NOTECRUMBS_BASE_URL=https://damus.io  # Required for correct sitemap URLs

Test plan

  • Verify /robots.txt returns correct content
  • Verify /sitemap.xml generates valid XML
  • Check metrics appear at /metrics
  • Submit sitemap to Google Search Console after deploy

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added /sitemap.xml endpoint that generates dynamic sitemaps including homepage, recent notes, articles, and user profiles for search engine indexing.
    • Added /robots.txt endpoint to guide search engine crawlers.
    • Implemented caching strategies to optimize performance and server load.

✏️ Tip: You can customize this high-level summary in your review settings.

alltheseas and others added 6 commits December 24, 2025 22:34
Add dynamic sitemap generation from nostrdb cache to improve search
engine discoverability of Nostr content.

New routes:
- GET /robots.txt - crawler directives with sitemap reference
- GET /sitemap.xml - dynamic sitemap from cached notes/profiles/articles

The sitemap queries local nostrdb for:
- Notes (kind:1) → note1xxx URLs
- Long-form articles (kind:30023) → naddr1xxx URLs
- Profiles (kind:0) → npub1xxx URLs

Ref: damus-io#26

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Track aggregate stats (privacy-preserving, no user tracking):
- sitemap_generations_total: counter for generation requests
- sitemap_generation_duration_seconds: time to generate
- sitemap_urls_total: total URLs in sitemap
- sitemap_notes_count: notes included
- sitemap_articles_count: articles included
- sitemap_profiles_count: profiles included

Metrics available at /metrics endpoint.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Skip kind:30023 entries with missing/empty d-tag to avoid ambiguous
  URLs and potential collisions across authors
- Add since filter (90 days) to notes and articles queries to prioritize
  recent content for SEO freshness
- Log warning when NOTECRUMBS_BASE_URL is not set, to surface potential
  misconfiguration in production

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Cache base URL with OnceLock to avoid logging warning on every request
- Use separate lookback periods: 90 days for notes, 365 days for
  evergreen article content (kind:30023)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use early returns and let-else patterns to reduce nesting depth in
generate_sitemap loops. Improves readability by making the happy path
linear instead of deeply indented.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@alltheseas
Copy link
Contributor Author

@coderabbitai review please

@coderabbitai
Copy link

coderabbitai bot commented Jan 16, 2026

@alltheseas I'll review the changes in this PR for you.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

Introduces sitemap and robots.txt generation functionality by adding a new sitemap module and two HTTP routes. The sitemap module queries the database for notes, articles, and profiles to generate an XML sitemap and robots.txt file. Routes serve these static files with appropriate caching headers and error handling.

Changes

Cohort / File(s) Summary
HTTP Routes
src/main.rs
Added two new routes: GET /robots.txt (returns robots.txt with long-term caching) and GET /sitemap.xml (queries database to generate sitemap XML with short-term caching; logs and returns 500 on generation errors). Module declaration for sitemap added.
Sitemap Generation
src/sitemap.rs
New module implementing generate_sitemap() that queries notes (kind 1, 90-day lookback), articles (kind 30023, 365-day lookback), and profiles (kind 0), building an XML sitemap with proper escaping and ISO date formatting. Implements generate_robots_txt() for standard robots.txt output. Includes base URL resolution from NOTECRUMBS_BASE_URL environment variable, date utilities with leap year handling, metrics integration, and comprehensive test coverage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A sitemap hops into view,
XML paths for Google to pursue,
Robots welcomed with open paws,
Notes and articles indexed with cause! 📝✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately and concisely describes the main feature being added: sitemap.xml and robots.txt for SEO purposes.
Linked Issues check ✅ Passed The PR implements all key requirements from issue #26: provides sitemap.xml for search engine discovery, enables Google Search Console integration, and facilitates search as a user-acquisition channel.
Out of Scope Changes check ✅ Passed All changes directly support sitemap.xml and robots.txt generation as specified in issue #26; no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/sitemap.rs`:
- Around line 269-276: Update the metrics calls to use the metrics 0.21.0 API:
replace the incorrect usage of counter!("sitemap_generations_total", 1) with
either the increment_counter! convenience macro or call
counter!("sitemap_generations_total").increment(1), and replace
gauge!("sitemap_generation_duration_seconds", duration.as_secs_f64()) and the
other gauge calls for "sitemap_urls_total", "sitemap_notes_count",
"sitemap_articles_count", and "sitemap_profiles_count" with
gauge("...").set(value) (or use a suitable gauge-set helper) so the macros
return handles before invoking .increment(...) or .set(...); use the existing
local variables start, duration (computed from start.elapsed()), entries.len(),
notes_count, articles_count, and profiles_count as the values passed to the
handle methods.
🧹 Nitpick comments (2)
src/sitemap.rs (2)

71-79: Redundant datetime conversion.

Lines 75-79 create a datetime from timestamp, then immediately extract the same value back as secs_since_epoch. This is unnecessary; you can use timestamp directly.

♻️ Suggested simplification
 fn format_lastmod(timestamp: u64) -> String {
-    use std::time::{Duration, UNIX_EPOCH};
-
-    let datetime = UNIX_EPOCH + Duration::from_secs(timestamp);
-    let secs_since_epoch = datetime
-        .duration_since(UNIX_EPOCH)
-        .unwrap_or_default()
-        .as_secs();
-
     // Simple date formatting without external dependencies
-    let days_since_epoch = secs_since_epoch / 86400;
+    let days_since_epoch = timestamp / 86400;
     let mut year = 1970i32;
     let mut remaining_days = days_since_epoch as i32;

160-160: Consider logging query failures for observability.

Using unwrap_or_default() provides resilience, but query failures are silently ignored. Consider logging a warning when queries fail to help diagnose issues in production. The same applies to lines 187 and 235.

♻️ Optional: Log query errors
-    let results = ndb.query(&txn, &[notes_filter], MAX_SITEMAP_URLS as i32).unwrap_or_default();
+    let results = ndb
+        .query(&txn, &[notes_filter], MAX_SITEMAP_URLS as i32)
+        .unwrap_or_else(|e| {
+            tracing::warn!("Failed to query notes for sitemap: {e}");
+            vec![]
+        });
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad9d797 and 34db03a.

📒 Files selected for processing (2)
  • src/main.rs
  • src/sitemap.rs
🧰 Additional context used
🧬 Code graph analysis (1)
src/main.rs (1)
src/sitemap.rs (2)
  • generate_robots_txt (282-294)
  • generate_sitemap (129-279)
🔇 Additional comments (10)
src/sitemap.rs (7)

1-22: LGTM!

Constants are well-documented and reasonable. Using OnceLock for lazy initialization of the base URL is appropriate for avoiding repeated environment variable lookups.


24-44: LGTM!

The lazy initialization pattern with a one-time warning for missing configuration is a good approach. The normalization correctly handles trailing slashes.


46-53: LGTM!

Good defensive coding with unwrap_or_default() and saturating_sub() to handle edge cases gracefully.


55-69: LGTM!

Correct XML escaping implementation covering all required special characters with appropriate capacity pre-allocation.


80-118: LGTM!

The date calculation logic is correct, including proper leap year handling with the standard formula.


281-294: LGTM!

The robots.txt output is well-structured with appropriate directives. The wildcard pattern /*.json is supported by major search engines (Google, Bing).


296-356: LGTM!

Good test coverage for the utility functions with meaningful test cases including edge cases like leap years.

src/main.rs (3)

36-36: LGTM!

Module declaration is correctly placed.


143-150: LGTM!

The route follows the established pattern for static content routes. The 24-hour cache duration is appropriate for robots.txt.


151-167: LGTM!

Good error handling pattern with logging before returning 500. The 1-hour cache is appropriate for dynamic content, and the XML content type is correct.

Note: generate_sitemap performs synchronous database queries, which is consistent with how other routes in this file use ndb. If sitemap generation becomes slow with larger databases, consider offloading to a blocking task with tokio::task::spawn_blocking.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +269 to +276
// Record metrics (aggregate stats, not user-tracking)
let duration = start.elapsed();
metrics::counter!("sitemap_generations_total", 1);
metrics::gauge!("sitemap_generation_duration_seconds", duration.as_secs_f64());
metrics::gauge!("sitemap_urls_total", entries.len() as f64);
metrics::gauge!("sitemap_notes_count", notes_count as f64);
metrics::gauge!("sitemap_articles_count", articles_count as f64);
metrics::gauge!("sitemap_profiles_count", profiles_count as f64);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check for metrics crate version in Cargo.toml
find . -name "Cargo.toml" -type f | head -5 | xargs grep -l "metrics" 2>/dev/null

Repository: damus-io/notecrumbs

Length of output: 75


🏁 Script executed:

# Also verify the actual code at the specified lines
if [ -f "src/sitemap.rs" ]; then
  sed -n '269,276p' src/sitemap.rs
fi

Repository: damus-io/notecrumbs

Length of output: 565


🌐 Web query:

metrics crate Rust 0.21.0 counter macro syntax increment

💡 Result:

Use the counter! macro to get a Counter handle, then call increment(n); or use the convenience macro increment_counter! to add 1 (or pass labels). Examples:

  • Get handle and increment by n:
    let c = counter!("my_metric");
    c.increment(5);
    [1]

  • Increment by 1 directly:
    increment_counter!("my_metric");
    // or with labels:
    increment_counter!("my_metric", "service" => "http");
    [2]

Docs (macro reference and examples) show the counter! and increment_counter! usage. [1] [2]

Sources:
[1] docs.rs: metrics::counter! macro.
[2] docs.rs / metrics macro increment_counter! (examples).


🏁 Script executed:

# Read the Cargo.toml to find the metrics version
cat Cargo.toml | grep -A 2 "^\[dependencies\]" | head -20

Repository: damus-io/notecrumbs

Length of output: 147


🏁 Script executed:

# Or search more directly for metrics
grep "metrics" Cargo.toml

Repository: damus-io/notecrumbs

Length of output: 116


🌐 Web query:

metrics crate Rust gauge macro syntax 0.21

💡 Result:

The gauge! macro (as of metrics 0.21) accepts these forms:

  • gauge!( $name:expr $(, $label_key:expr $(=> $label_value:expr)? )* $(,)? )
  • gauge!( level: $level:expr, $name:expr $(, $label_key:expr $(=> $label_value:expr)? )* $(,)? )
  • gauge!( target: $target:expr, $name:expr $(, $label_key:expr $(=> $label_value:expr)? )* $(,)? )
  • gauge!( target: $target:expr, level: $level:expr, $name:expr $(, $label_key:expr $(=> $label_value:expr)? )* $(,)? )

Examples:

  • let g = gauge!("my_gauge");
  • let g = gauge!("my_gauge", "service" => "http");
  • let g = gauge!(level: Level::DEBUG, "my_gauge");
  • let g = gauge!(format!("{}{}", prefix, "g"), &labels);

(You get a Gauge handle which supports .increment(), .decrement(), .set().) (docs.rs)

If you want the docs page for that macro/version, see the metrics macro docs and the 0.21.0 changelog. (docs.rs)

Citations:


Fix metrics macro syntax for version 0.21.0.

The counter!() and gauge!() macros in metrics 0.21.0 return handles that require method calls. The current syntax is incorrect:

  • counter!("name", value) should be counter!("name").increment(value) (or use increment_counter!("name") convenience macro)
  • gauge!("name", value) should be gauge!("name").set(value)

All metrics calls in lines 269-276 need to be updated to use the correct API.

🤖 Prompt for AI Agents
In `@src/sitemap.rs` around lines 269 - 276, Update the metrics calls to use the
metrics 0.21.0 API: replace the incorrect usage of
counter!("sitemap_generations_total", 1) with either the increment_counter!
convenience macro or call counter!("sitemap_generations_total").increment(1),
and replace gauge!("sitemap_generation_duration_seconds",
duration.as_secs_f64()) and the other gauge calls for "sitemap_urls_total",
"sitemap_notes_count", "sitemap_articles_count", and "sitemap_profiles_count"
with gauge("...").set(value) (or use a suitable gauge-set helper) so the macros
return handles before invoking .increment(...) or .set(...); use the existing
local variables start, duration (computed from start.elapsed()), entries.len(),
notes_count, articles_count, and profiles_count as the values passed to the
handle methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SEO: sitemap.xml

1 participant