Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions src/aggregator.js
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,30 @@ function smartTruncate(text, maxLength = 500) {
return truncated.trim() + '...';
}

/**
* HTML-escape a string so it is safe to insert into HTML contexts.
* Converts &, <, and > to their corresponding entities.
* @param {string} input
* @returns {string}
*/
function htmlEscape(input) {
if (!input) {
return '';
}
return input
.replace(/&/g, '&amp;')
.replace(/</g, '&lt;')
.replace(/>/g, '&gt;');
}

// Sanitize and process articles
function sanitizeArticle(article, sourceName, tags, category) {
const rawSummary = htmlEscape(
article.contentSnippet?.replace(/<[^>]*>/g, '') || ''
);

return {
title: htmlEscape(article.title?.replace(/<[^>]*>/g, '') || '').slice(0, 200) || 'Untitled',
Comment on lines +78 to +101
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTML-escaping the title and summary at the data layer causes issues in multiple contexts where the data is used:

  1. Markdown generation (line 199-201): HTML entities like &lt;, &gt;, and &amp; will appear as literal text in the generated README markdown, making content harder to read.

  2. LinkedIn posts (line 249): If re-enabled, the LinkedIn API would receive HTML entities in the post text, which would be displayed to users as &lt; instead of <.

  3. Reader HTML (reader.html:469): Uses textContent to insert the title, which means HTML entities would be double-escaped and shown to users literally.

  4. Stats HTML (stats.html:436): Uses DOMPurify which expects raw text, not pre-escaped text, leading to visible entities.

The original regex-based sanitization approach (removing HTML tags) was more appropriate for this use case. If HTML injection is a concern, it should be addressed at the presentation layer (where it's already done with DOMPurify and textContent), not at the data storage layer. Consider reverting the htmlEscape calls and relying on the existing output-side protections.

Suggested change
/**
* HTML-escape a string so it is safe to insert into HTML contexts.
* Converts &, <, and > to their corresponding entities.
* @param {string} input
* @returns {string}
*/
function htmlEscape(input) {
if (!input) {
return '';
}
return input
.replace(/&/g, '&amp;')
.replace(/</g, '&lt;')
.replace(/>/g, '&gt;');
}
// Sanitize and process articles
function sanitizeArticle(article, sourceName, tags, category) {
const rawSummary = htmlEscape(
article.contentSnippet?.replace(/<[^>]*>/g, '') || ''
);
return {
title: htmlEscape(article.title?.replace(/<[^>]*>/g, '') || '').slice(0, 200) || 'Untitled',
// Note: HTML escaping is intentionally handled at the presentation layer
// (e.g., via textContent and DOMPurify). At the data layer we only strip
// HTML tags (see sanitizeText) and keep the underlying text unescaped.
// Sanitize and process articles
function sanitizeArticle(article, sourceName, tags, category) {

Copilot uses AI. Check for mistakes.
const rawSummary = sanitizeText(article.contentSnippet) || '';

return {
Expand Down