Potential fix for code scanning alert no. 12: Incomplete multi-character sanitization#11
Conversation
…ter sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
|
Il faut tout corriger. |
|
@copilot Corrige tout |
|
@EthanThePhoenix38 I've opened a new pull request, #17, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Pull request overview
This PR attempts to fix a code scanning alert (#12) about incomplete multi-character sanitization by adding HTML escaping to article titles and summaries in the RSS feed aggregation pipeline. The change introduces a new htmlEscape function that converts &, <, and > to their HTML entity equivalents (&, <, >) and applies this escaping to the data layer.
Key changes:
- Added
htmlEscapehelper function to convert special HTML characters to entities - Applied HTML escaping to article titles and content snippets after tag removal
- Modified the sanitization flow to escape HTML entities at the data storage layer
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** | ||
| * HTML-escape a string so it is safe to insert into HTML contexts. | ||
| * Converts &, <, and > to their corresponding entities. | ||
| * @param {string} input | ||
| * @returns {string} | ||
| */ | ||
| function htmlEscape(input) { | ||
| if (!input) { | ||
| return ''; | ||
| } | ||
| return input | ||
| .replace(/&/g, '&') | ||
| .replace(/</g, '<') | ||
| .replace(/>/g, '>'); | ||
| } | ||
|
|
||
| // Sanitize and process articles | ||
| function sanitizeArticle(article, sourceName, tags, category) { | ||
| const rawSummary = htmlEscape( | ||
| article.contentSnippet?.replace(/<[^>]*>/g, '') || '' | ||
| ); | ||
|
|
||
| return { | ||
| title: htmlEscape(article.title?.replace(/<[^>]*>/g, '') || '').slice(0, 200) || 'Untitled', |
There was a problem hiding this comment.
HTML-escaping the title and summary at the data layer causes issues in multiple contexts where the data is used:
-
Markdown generation (line 199-201): HTML entities like
<,>, and&will appear as literal text in the generated README markdown, making content harder to read. -
LinkedIn posts (line 249): If re-enabled, the LinkedIn API would receive HTML entities in the post text, which would be displayed to users as
<instead of<. -
Reader HTML (reader.html:469): Uses
textContentto insert the title, which means HTML entities would be double-escaped and shown to users literally. -
Stats HTML (stats.html:436): Uses DOMPurify which expects raw text, not pre-escaped text, leading to visible entities.
The original regex-based sanitization approach (removing HTML tags) was more appropriate for this use case. If HTML injection is a concern, it should be addressed at the presentation layer (where it's already done with DOMPurify and textContent), not at the data storage layer. Consider reverting the htmlEscape calls and relying on the existing output-side protections.
| /** | |
| * HTML-escape a string so it is safe to insert into HTML contexts. | |
| * Converts &, <, and > to their corresponding entities. | |
| * @param {string} input | |
| * @returns {string} | |
| */ | |
| function htmlEscape(input) { | |
| if (!input) { | |
| return ''; | |
| } | |
| return input | |
| .replace(/&/g, '&') | |
| .replace(/</g, '<') | |
| .replace(/>/g, '>'); | |
| } | |
| // Sanitize and process articles | |
| function sanitizeArticle(article, sourceName, tags, category) { | |
| const rawSummary = htmlEscape( | |
| article.contentSnippet?.replace(/<[^>]*>/g, '') || '' | |
| ); | |
| return { | |
| title: htmlEscape(article.title?.replace(/<[^>]*>/g, '') || '').slice(0, 200) || 'Untitled', | |
| // Note: HTML escaping is intentionally handled at the presentation layer | |
| // (e.g., via textContent and DOMPurify). At the data layer we only strip | |
| // HTML tags (see sanitizeText) and keep the underlying text unescaped. | |
| // Sanitize and process articles | |
| function sanitizeArticle(article, sourceName, tags, category) { |
Potential fix for https://github.com/ThePhoenixAgency/AI-Pulse/security/code-scanning/12
In general, the safest fix is to stop doing ad-hoc HTML sanitization with hand-written regexes and instead either: (a) rely on proper HTML escaping at the sink, or (b) use a well-tested sanitization/escaping library when you need HTML-safe output. Here, the analyzer specifically worries about
<scriptsurviving the regex; we can mitigate this within the shown snippet by (1) using a more robust sanitization approach forarticle.titleandrawSummary, or (2) explicitly neutralizing any</>characters that remain after the current regex.Given the constraints (only editing this file, minimal behavior change, and no knowledge of how titles are rendered later), the safest local change is to ensure the resulting
titleandsummarycannot introduce HTML elements at all. We can do this by HTML-escaping<,>, and&in the text after stripping tags, which preserves the visible text but makes it safe for insertion into HTML. Concretely:htmlEscapethat replaces&,<, and>with their HTML entities. This uses only standard JS, no extra imports.sanitizeArticle:rawSummary: keep the existingreplace(/<[^>]*>/g, '')and thenhtmlEscapethe result before passing tosmartTruncate.title: apply the samereplaceand thenhtmlEscape, thenslice(0, 200).<scriptor stray<characters remain or are introduced by edge cases, they will be rendered as text and not interpreted as HTML.All changes are within
src/aggregator.js, nearsanitizeArticle. We add the helper abovesanitizeArticleand update the lines buildingrawSummaryandtitleaccordingly.Suggested fixes powered by Copilot Autofix. Review carefully before merging.