Skip to content

Potential fix for code scanning alert no. 12: Incomplete multi-character sanitization#11

Merged
EthanThePhoenix38 merged 2 commits intomainfrom
alert-autofix-12
Jan 8, 2026
Merged

Potential fix for code scanning alert no. 12: Incomplete multi-character sanitization#11
EthanThePhoenix38 merged 2 commits intomainfrom
alert-autofix-12

Conversation

@EthanThePhoenix38
Copy link
Member

Potential fix for https://github.com/ThePhoenixAgency/AI-Pulse/security/code-scanning/12

In general, the safest fix is to stop doing ad-hoc HTML sanitization with hand-written regexes and instead either: (a) rely on proper HTML escaping at the sink, or (b) use a well-tested sanitization/escaping library when you need HTML-safe output. Here, the analyzer specifically worries about <script surviving the regex; we can mitigate this within the shown snippet by (1) using a more robust sanitization approach for article.title and rawSummary, or (2) explicitly neutralizing any < / > characters that remain after the current regex.

Given the constraints (only editing this file, minimal behavior change, and no knowledge of how titles are rendered later), the safest local change is to ensure the resulting title and summary cannot introduce HTML elements at all. We can do this by HTML-escaping <, >, and & in the text after stripping tags, which preserves the visible text but makes it safe for insertion into HTML. Concretely:

  • Factor out a small helper htmlEscape that replaces &, <, and > with their HTML entities. This uses only standard JS, no extra imports.
  • Use this helper in sanitizeArticle:
    • For rawSummary: keep the existing replace(/<[^>]*>/g, '') and then htmlEscape the result before passing to smartTruncate.
    • For title: apply the same replace and then htmlEscape, then slice(0, 200).
  • This ensures that even if something like <script or stray < characters remain or are introduced by edge cases, they will be rendered as text and not interpreted as HTML.

All changes are within src/aggregator.js, near sanitizeArticle. We add the helper above sanitizeArticle and update the lines building rawSummary and title accordingly.

Suggested fixes powered by Copilot Autofix. Review carefully before merging.

…ter sanitization

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@EthanThePhoenix38
Copy link
Member Author

Il faut tout corriger.

@EthanThePhoenix38 EthanThePhoenix38 marked this pull request as ready for review January 8, 2026 20:33
Copilot AI review requested due to automatic review settings January 8, 2026 20:33
@EthanThePhoenix38
Copy link
Member Author

@copilot Corrige tout

Copy link
Contributor

Copilot AI commented Jan 8, 2026

@EthanThePhoenix38 I've opened a new pull request, #17, to work on those changes. Once the pull request is ready, I'll request review from you.

@EthanThePhoenix38 EthanThePhoenix38 merged commit aef78d4 into main Jan 8, 2026
5 checks passed
@EthanThePhoenix38 EthanThePhoenix38 deleted the alert-autofix-12 branch January 8, 2026 20:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to fix a code scanning alert (#12) about incomplete multi-character sanitization by adding HTML escaping to article titles and summaries in the RSS feed aggregation pipeline. The change introduces a new htmlEscape function that converts &, <, and > to their HTML entity equivalents (&amp;, &lt;, &gt;) and applies this escaping to the data layer.

Key changes:

  • Added htmlEscape helper function to convert special HTML characters to entities
  • Applied HTML escaping to article titles and content snippets after tag removal
  • Modified the sanitization flow to escape HTML entities at the data storage layer

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +78 to +101
/**
* HTML-escape a string so it is safe to insert into HTML contexts.
* Converts &, <, and > to their corresponding entities.
* @param {string} input
* @returns {string}
*/
function htmlEscape(input) {
if (!input) {
return '';
}
return input
.replace(/&/g, '&amp;')
.replace(/</g, '&lt;')
.replace(/>/g, '&gt;');
}

// Sanitize and process articles
function sanitizeArticle(article, sourceName, tags, category) {
const rawSummary = htmlEscape(
article.contentSnippet?.replace(/<[^>]*>/g, '') || ''
);

return {
title: htmlEscape(article.title?.replace(/<[^>]*>/g, '') || '').slice(0, 200) || 'Untitled',
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTML-escaping the title and summary at the data layer causes issues in multiple contexts where the data is used:

  1. Markdown generation (line 199-201): HTML entities like &lt;, &gt;, and &amp; will appear as literal text in the generated README markdown, making content harder to read.

  2. LinkedIn posts (line 249): If re-enabled, the LinkedIn API would receive HTML entities in the post text, which would be displayed to users as &lt; instead of <.

  3. Reader HTML (reader.html:469): Uses textContent to insert the title, which means HTML entities would be double-escaped and shown to users literally.

  4. Stats HTML (stats.html:436): Uses DOMPurify which expects raw text, not pre-escaped text, leading to visible entities.

The original regex-based sanitization approach (removing HTML tags) was more appropriate for this use case. If HTML injection is a concern, it should be addressed at the presentation layer (where it's already done with DOMPurify and textContent), not at the data storage layer. Consider reverting the htmlEscape calls and relying on the existing output-side protections.

Suggested change
/**
* HTML-escape a string so it is safe to insert into HTML contexts.
* Converts &, <, and > to their corresponding entities.
* @param {string} input
* @returns {string}
*/
function htmlEscape(input) {
if (!input) {
return '';
}
return input
.replace(/&/g, '&amp;')
.replace(/</g, '&lt;')
.replace(/>/g, '&gt;');
}
// Sanitize and process articles
function sanitizeArticle(article, sourceName, tags, category) {
const rawSummary = htmlEscape(
article.contentSnippet?.replace(/<[^>]*>/g, '') || ''
);
return {
title: htmlEscape(article.title?.replace(/<[^>]*>/g, '') || '').slice(0, 200) || 'Untitled',
// Note: HTML escaping is intentionally handled at the presentation layer
// (e.g., via textContent and DOMPurify). At the data layer we only strip
// HTML tags (see sanitizeText) and keep the underlying text unescaped.
// Sanitize and process articles
function sanitizeArticle(article, sourceName, tags, category) {

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants