Skip to content

feat: Add Archive.ph and paywall bypass services for better article e…#122

Merged
EthanThePhoenix38 merged 1 commit intomainfrom
claude/paywall-bypass-G1Gdx
Feb 18, 2026
Merged

feat: Add Archive.ph and paywall bypass services for better article e…#122
EthanThePhoenix38 merged 1 commit intomainfrom
claude/paywall-bypass-G1Gdx

Conversation

@EthanThePhoenix38
Copy link
Member

@EthanThePhoenix38 EthanThePhoenix38 commented Feb 18, 2026

…xtraction


Continue Tasks: ▶️ 1 queued — View all

@EthanThePhoenix38 EthanThePhoenix38 marked this pull request as ready for review February 18, 2026 19:48
Copilot AI review requested due to automatic review settings February 18, 2026 19:48
@EthanThePhoenix38 EthanThePhoenix38 merged commit 4694407 into main Feb 18, 2026
5 checks passed
@EthanThePhoenix38 EthanThePhoenix38 deleted the claude/paywall-bypass-G1Gdx branch February 18, 2026 19:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds paywall bypass functionality using third-party services (archive.ph, scribe.rip, web.archive.org) to extract content from paywalled articles. The implementation adds a reactive bypass mechanism that attempts to fetch content from bypass services when paywall text is detected after the initial article fetch fails.

Changes:

  • Added paywalledHosts list in addUTMParams() function (currently non-functional)
  • Implemented tryPaywallBypass() function with fallback logic through multiple services
  • Integrated bypass attempts into the article processing flow when paywall content is detected
  • Modified fallback summary logic to handle cases where smartTruncate returns empty

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

];

for (const service of bypassServices) {
const bypassUrl = service.transform(url);
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL parameter passed to the transform functions is not validated before being used in external service URLs. The URL should be validated using normalizeExternalUrl() (defined at line 319) to ensure it has a valid http/https protocol before constructing bypass service URLs. This prevents potential URL injection attacks.

Copilot uses AI. Check for mistakes.
});
return { success: true, html: response.data, service: service.name };
} catch (e) {
// Continuer vers le service suivant
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling silently continues to the next service without any logging. This makes debugging difficult when paywall bypass attempts fail. Consider adding debug logging (similar to the pattern used in shouldSuppressExtractionLog) to track which services were attempted and why they failed, especially useful for improving the bypass logic over time.

Suggested change
// Continuer vers le service suivant
// Continuer vers le service suivant, mais journaliser en mode debug si activé
if (process.env.DEBUG_PAYWALL_BYPASS === '1' || process.env.DEBUG_PAYWALL_BYPASS === 'true') {
console.debug(
`[paywall-bypass] Service "${service.name}" failed for URL "${url}" (bypass URL: "${bypassUrl}"):`,
e && e.message ? e.message : e
);
}

Copilot uses AI. Check for mistakes.
const writeFallbackLocalArticle = () => {
const safeTitle = sanitizeText(article.title) || 'Untitled';
const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.';
const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable for this article.');
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic here uses both smartTruncate result and a ternary with sanitizeText(rawSummary) as fallback. However, the conditional rawSummary check may be redundant since rawSummary is already used in the first part of the OR expression. If smartTruncate returns an empty string when rawSummary is empty, the ternary expression (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable...') will still check rawSummary again. Consider simplifying to: smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.'

Suggested change
const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable for this article.');
const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.';

Copilot uses AI. Check for mistakes.
Comment on lines +1421 to +1444
const bypassServices = [
{
name: 'archive.ph',
transform: (u) => `https://archive.ph/?url=${encodeURIComponent(u)}`
},
{
name: 'scribe.rip',
transform: (u) => u.includes('medium.com') ? u.replace('medium.com', 'scribe.rip') : null
},
{
name: 'web.archive.org',
transform: (u) => `https://web.archive.org/web/*/${u}`
}
];

for (const service of bypassServices) {
const bypassUrl = service.transform(url);
if (!bypassUrl) continue;
try {
const response = await axios.get(bypassUrl, {
timeout: 5000,
headers: { 'User-Agent': 'AI-Pulse/3.0' }
});
return { success: true, html: response.data, service: service.name };
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bypass services (archive.ph, scribe.rip, web.archive.org) are external third-party services that may have rate limits, availability issues, or terms of service restrictions on automated access. There's no rate limiting, caching, or error tracking implemented. Consider: 1) implementing rate limiting to avoid being blocked, 2) caching successful bypass results to avoid repeated requests, 3) tracking service availability to skip consistently failing services, and 4) reviewing each service's terms of service for compliance with automated access policies.

Copilot uses AI. Check for mistakes.
Comment on lines +1420 to +1450
const tryPaywallBypass = async (url) => {
const bypassServices = [
{
name: 'archive.ph',
transform: (u) => `https://archive.ph/?url=${encodeURIComponent(u)}`
},
{
name: 'scribe.rip',
transform: (u) => u.includes('medium.com') ? u.replace('medium.com', 'scribe.rip') : null
},
{
name: 'web.archive.org',
transform: (u) => `https://web.archive.org/web/*/${u}`
}
];

for (const service of bypassServices) {
const bypassUrl = service.transform(url);
if (!bypassUrl) continue;
try {
const response = await axios.get(bypassUrl, {
timeout: 5000,
headers: { 'User-Agent': 'AI-Pulse/3.0' }
});
return { success: true, html: response.data, service: service.name };
} catch (e) {
// Continuer vers le service suivant
}
}
return { success: false };
};
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codebase already implements a proactive paywall bypass strategy for Medium at line 260-262 where Medium URLs are rewritten to use freedium.cloud before the article is fetched. This new reactive approach (checking for paywall text after fetching) is inconsistent with the existing pattern. Consider either: 1) extending the proactive approach at lines 265-270 to rewrite URLs for paywalledHosts upfront (e.g., using archive.ph), or 2) documenting why a reactive approach is preferred for these specific domains.

Copilot uses AI. Check for mistakes.
Comment on lines +1429 to +1432
},
{
name: 'web.archive.org',
transform: (u) => `https://web.archive.org/web/*/${u}`
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The web.archive.org URL pattern uses a wildcard timestamp selector which will attempt to fetch a search results page rather than a specific archived version. This is unlikely to work for content extraction. Consider either using a specific timestamp API or removing this service as it won't reliably return article content that can be parsed by Readability.

Suggested change
},
{
name: 'web.archive.org',
transform: (u) => `https://web.archive.org/web/*/${u}`

Copilot uses AI. Check for mistakes.
Comment on lines +1692 to 1693
}
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: this line should be indented to match the level of the closing braces above and below it (lines 1693, 1695). The fs.writeFileSync call at line 1691 is at the correct indentation level, so this closing brace should align with the opening structure.

Suggested change
}
}
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +1436 to +1449
for (const service of bypassServices) {
const bypassUrl = service.transform(url);
if (!bypassUrl) continue;
try {
const response = await axios.get(bypassUrl, {
timeout: 5000,
headers: { 'User-Agent': 'AI-Pulse/3.0' }
});
return { success: true, html: response.data, service: service.name };
} catch (e) {
// Continuer vers le service suivant
}
}
return { success: false };
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5-second timeout for each bypass service could lead to significant delays when processing articles. With 3 services configured, a paywalled article could take up to 15 seconds before falling back. Consider: 1) reducing the timeout to 3000ms, 2) implementing parallel requests with Promise.race(), or 3) adding a configurable overall timeout for the entire bypass attempt to prevent feed processing delays.

Suggested change
for (const service of bypassServices) {
const bypassUrl = service.transform(url);
if (!bypassUrl) continue;
try {
const response = await axios.get(bypassUrl, {
timeout: 5000,
headers: { 'User-Agent': 'AI-Pulse/3.0' }
});
return { success: true, html: response.data, service: service.name };
} catch (e) {
// Continuer vers le service suivant
}
}
return { success: false };
// Timeout individuel par service et timeout global pour l'ensemble de la tentative
const SERVICE_TIMEOUT_MS = 3000;
const OVERALL_TIMEOUT_MS = 7000;
return new Promise((resolve) => {
const overallTimer = setTimeout(() => {
resolve({ success: false, reason: 'timeout' });
}, OVERALL_TIMEOUT_MS);
(async () => {
for (const service of bypassServices) {
const bypassUrl = service.transform(url);
if (!bypassUrl) continue;
try {
const response = await axios.get(bypassUrl, {
timeout: SERVICE_TIMEOUT_MS,
headers: { 'User-Agent': 'AI-Pulse/3.0' }
});
clearTimeout(overallTimer);
return resolve({ success: true, html: response.data, service: service.name });
} catch (e) {
// Continuer vers le service suivant en cas d'erreur ou de timeout
}
}
clearTimeout(overallTimer);
return resolve({ success: false });
})();
});

Copilot uses AI. Check for mistakes.
Comment on lines +263 to +270

// Liste des domaines avec paywalls stricts
const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com'];

// Ajouter Archive.ph en query parameter pour fallback
if (paywalledHosts.some(host => hostname.includes(host))) {
// On ne change pas l'URL ici, on l'utilisera comme fallback
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block checks for paywalled hosts but doesn't perform any action. The comment indicates the URL will be used as a fallback, but no fallback mechanism is implemented here or referenced elsewhere in the addUTMParams function. This check should either implement the fallback logic or be removed as dead code.

Suggested change
// Liste des domaines avec paywalls stricts
const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com'];
// Ajouter Archive.ph en query parameter pour fallback
if (paywalledHosts.some(host => hostname.includes(host))) {
// On ne change pas l'URL ici, on l'utilisera comme fallback
}

Copilot uses AI. Check for mistakes.
const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com'];

// Ajouter Archive.ph en query parameter pour fallback
if (paywalledHosts.some(host => hostname.includes(host))) {
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hostname matching uses includes() which is vulnerable to subdomain bypass attacks. For example, a malicious domain like "evil-ft.com.attacker.com" would match "ft.com". Use the existing hostnameMatches() helper function (defined at line 584) which implements secure suffix-based matching, or use the pattern: host === domain || host.endsWith('.'+domain).

Suggested change
if (paywalledHosts.some(host => hostname.includes(host))) {
if (paywalledHosts.some(domain => hostname === domain || hostname.endsWith('.' + domain))) {

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants