Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 62 additions & 6 deletions src/aggregator.js
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,14 @@ function addUTMParams(url, category = 'general') {
if (mediumHosts.includes(hostname)) {
url = `https://freedium.cloud/${url}`;
}

// Liste des domaines avec paywalls stricts
const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com'];

// Ajouter Archive.ph en query parameter pour fallback
if (paywalledHosts.some(host => hostname.includes(host))) {
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hostname matching uses includes() which is vulnerable to subdomain bypass attacks. For example, a malicious domain like "evil-ft.com.attacker.com" would match "ft.com". Use the existing hostnameMatches() helper function (defined at line 584) which implements secure suffix-based matching, or use the pattern: host === domain || host.endsWith('.'+domain).

Suggested change
if (paywalledHosts.some(host => hostname.includes(host))) {
if (paywalledHosts.some(domain => hostname === domain || hostname.endsWith('.' + domain))) {

Copilot uses AI. Check for mistakes.
// On ne change pas l'URL ici, on l'utilisera comme fallback
}
Comment on lines +263 to +270
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block checks for paywalled hosts but doesn't perform any action. The comment indicates the URL will be used as a fallback, but no fallback mechanism is implemented here or referenced elsewhere in the addUTMParams function. This check should either implement the fallback logic or be removed as dead code.

Suggested change
// Liste des domaines avec paywalls stricts
const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com'];
// Ajouter Archive.ph en query parameter pour fallback
if (paywalledHosts.some(host => hostname.includes(host))) {
// On ne change pas l'URL ici, on l'utilisera comme fallback
}

Copilot uses AI. Check for mistakes.
} catch (e) {
// Erreur de parsing URL, on continue sans modification
}
Expand Down Expand Up @@ -1408,9 +1416,42 @@ async function processArticle(article, sourceName, tags, category, feedLang) {
}
const lang = detectedLang || feedLang || 'en';

// Essayer d'utiliser Archive.ph ou autres services de bypass
const tryPaywallBypass = async (url) => {
const bypassServices = [
{
name: 'archive.ph',
transform: (u) => `https://archive.ph/?url=${encodeURIComponent(u)}`
},
{
name: 'scribe.rip',
transform: (u) => u.includes('medium.com') ? u.replace('medium.com', 'scribe.rip') : null
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scribe.rip transform uses a simple string replacement with includes() and replace(). This approach is unsafe as it doesn't properly validate hostnames and could match partial strings. For example, "notmedium.com" would match and be incorrectly transformed. Use the hostnameMatches() helper function to ensure proper domain matching.

Suggested change
transform: (u) => u.includes('medium.com') ? u.replace('medium.com', 'scribe.rip') : null
transform: (u) => {
try {
const urlObj = new URL(u);
if (!hostnameMatches || !hostnameMatches(urlObj.hostname, 'medium.com')) {
return null;
}
urlObj.hostname = 'scribe.rip';
return urlObj.toString();
} catch {
return null;
}
}

Copilot uses AI. Check for mistakes.
},
{
name: 'web.archive.org',
transform: (u) => `https://web.archive.org/web/*/${u}`
Comment on lines +1429 to +1432
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The web.archive.org URL pattern uses a wildcard timestamp selector which will attempt to fetch a search results page rather than a specific archived version. This is unlikely to work for content extraction. Consider either using a specific timestamp API or removing this service as it won't reliably return article content that can be parsed by Readability.

Suggested change
},
{
name: 'web.archive.org',
transform: (u) => `https://web.archive.org/web/*/${u}`

Copilot uses AI. Check for mistakes.
}
];

for (const service of bypassServices) {
const bypassUrl = service.transform(url);
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL parameter passed to the transform functions is not validated before being used in external service URLs. The URL should be validated using normalizeExternalUrl() (defined at line 319) to ensure it has a valid http/https protocol before constructing bypass service URLs. This prevents potential URL injection attacks.

Copilot uses AI. Check for mistakes.
if (!bypassUrl) continue;
try {
const response = await axios.get(bypassUrl, {
timeout: 5000,
headers: { 'User-Agent': 'AI-Pulse/3.0' }
});
return { success: true, html: response.data, service: service.name };
Comment on lines +1421 to +1444
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bypass services (archive.ph, scribe.rip, web.archive.org) are external third-party services that may have rate limits, availability issues, or terms of service restrictions on automated access. There's no rate limiting, caching, or error tracking implemented. Consider: 1) implementing rate limiting to avoid being blocked, 2) caching successful bypass results to avoid repeated requests, 3) tracking service availability to skip consistently failing services, and 4) reviewing each service's terms of service for compliance with automated access policies.

Copilot uses AI. Check for mistakes.
} catch (e) {
// Continuer vers le service suivant
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling silently continues to the next service without any logging. This makes debugging difficult when paywall bypass attempts fail. Consider adding debug logging (similar to the pattern used in shouldSuppressExtractionLog) to track which services were attempted and why they failed, especially useful for improving the bypass logic over time.

Suggested change
// Continuer vers le service suivant
// Continuer vers le service suivant, mais journaliser en mode debug si activé
if (process.env.DEBUG_PAYWALL_BYPASS === '1' || process.env.DEBUG_PAYWALL_BYPASS === 'true') {
console.debug(
`[paywall-bypass] Service "${service.name}" failed for URL "${url}" (bypass URL: "${bypassUrl}"):`,
e && e.message ? e.message : e
);
}

Copilot uses AI. Check for mistakes.
}
}
return { success: false };
Comment on lines +1436 to +1449
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5-second timeout for each bypass service could lead to significant delays when processing articles. With 3 services configured, a paywalled article could take up to 15 seconds before falling back. Consider: 1) reducing the timeout to 3000ms, 2) implementing parallel requests with Promise.race(), or 3) adding a configurable overall timeout for the entire bypass attempt to prevent feed processing delays.

Suggested change
for (const service of bypassServices) {
const bypassUrl = service.transform(url);
if (!bypassUrl) continue;
try {
const response = await axios.get(bypassUrl, {
timeout: 5000,
headers: { 'User-Agent': 'AI-Pulse/3.0' }
});
return { success: true, html: response.data, service: service.name };
} catch (e) {
// Continuer vers le service suivant
}
}
return { success: false };
// Timeout individuel par service et timeout global pour l'ensemble de la tentative
const SERVICE_TIMEOUT_MS = 3000;
const OVERALL_TIMEOUT_MS = 7000;
return new Promise((resolve) => {
const overallTimer = setTimeout(() => {
resolve({ success: false, reason: 'timeout' });
}, OVERALL_TIMEOUT_MS);
(async () => {
for (const service of bypassServices) {
const bypassUrl = service.transform(url);
if (!bypassUrl) continue;
try {
const response = await axios.get(bypassUrl, {
timeout: SERVICE_TIMEOUT_MS,
headers: { 'User-Agent': 'AI-Pulse/3.0' }
});
clearTimeout(overallTimer);
return resolve({ success: true, html: response.data, service: service.name });
} catch (e) {
// Continuer vers le service suivant en cas d'erreur ou de timeout
}
}
clearTimeout(overallTimer);
return resolve({ success: false });
})();
});

Copilot uses AI. Check for mistakes.
};
Comment on lines +1420 to +1450
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codebase already implements a proactive paywall bypass strategy for Medium at line 260-262 where Medium URLs are rewritten to use freedium.cloud before the article is fetched. This new reactive approach (checking for paywall text after fetching) is inconsistent with the existing pattern. Consider either: 1) extending the proactive approach at lines 265-270 to rewrite URLs for paywalledHosts upfront (e.g., using archive.ph), or 2) documenting why a reactive approach is preferred for these specific domains.

Copilot uses AI. Check for mistakes.

const writeFallbackLocalArticle = () => {
const safeTitle = sanitizeText(article.title) || 'Untitled';
const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.';
const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable for this article.');
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic here uses both smartTruncate result and a ternary with sanitizeText(rawSummary) as fallback. However, the conditional rawSummary check may be redundant since rawSummary is already used in the first part of the OR expression. If smartTruncate returns an empty string when rawSummary is empty, the ternary expression (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable...') will still check rawSummary again. Consider simplifying to: smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.'

Suggested change
const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable for this article.');
const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.';

Copilot uses AI. Check for mistakes.
const fallbackHtml = `<!DOCTYPE html>
<html lang="${lang}">
<head>
Expand Down Expand Up @@ -1518,11 +1559,26 @@ async function processArticle(article, sourceName, tags, category, feedLang) {

if (articleContent && articleContent.textContent) {
if (isPaywallText(articleContent.textContent)) {
writeFallbackLocalArticle();
} else {
if (isLikelyBoilerplateExtraction(articleContent.textContent)) {
writeFallbackLocalArticle();
// Essayer les services de bypass avant de renoncer
const bypassResult = await tryPaywallBypass(resolvedArticleUrl);
if (bypassResult.success) {
const bypassDom = createSafeDom(bypassResult.html, resolvedArticleUrl);
const bypassReader = new Readability(bypassDom.window.document);
const bypassContent = bypassReader.parse();
if (bypassContent && bypassContent.textContent && !isPaywallText(bypassContent.textContent) && bypassContent.textContent.length > 200) {
articleContent = bypassContent;
// Continuer le traitement normal avec le contenu bypassed
} else {
writeFallbackLocalArticle();
}
} else {
writeFallbackLocalArticle();
}
Comment on lines +1568 to +1576
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the paywall bypass succeeds and articleContent is updated (line 1569), the code continues but doesn't return early. This means execution falls through to line 1577 (the else block), which expects articleContent to not have paywall text. However, since we're still in the "if (isPaywallText)" block (line 1561), we never reach the processing code in the else block at lines 1577-1691. The successful bypass scenario should either: 1) set a flag and restructure the logic to process the bypassed content, or 2) refactor to extract the processing logic into a shared code path that can be reached from both the bypass success and normal content paths.

Copilot uses AI. Check for mistakes.
} else {
// Contenu normal sans paywall
if (isLikelyBoilerplateExtraction(articleContent.textContent)) {
writeFallbackLocalArticle();
} else {
if (!computedSummary || computedSummary.trim().length < 20) {
computedSummary = trimPromotionalTailText(cleanupNoiseText(sanitizeText(articleContent.textContent.slice(0, 1400))));
}
Expand Down Expand Up @@ -1633,10 +1689,10 @@ async function processArticle(article, sourceName, tags, category, feedLang) {

// Sauvegarder le fichier HTML localement
fs.writeFileSync(localPath, cleanHtml);
}
}
Comment on lines +1692 to 1693
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: this line should be indented to match the level of the closing braces above and below it (lines 1693, 1695). The fs.writeFileSync call at line 1691 is at the correct indentation level, so this closing brace should align with the opening structure.

Suggested change
}
}
}
}

Copilot uses AI. Check for mistakes.
}
}
}
} catch (e) {
if (!shouldSuppressExtractionLog(resolvedArticleUrl, e)) {
console.error(` Could not extract content for: ${articleUrl}`);
Expand Down
Loading