feat: Add Archive.ph and paywall bypass services for better article e…#122
feat: Add Archive.ph and paywall bypass services for better article e…#122EthanThePhoenix38 merged 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds paywall bypass functionality using third-party services (archive.ph, scribe.rip, web.archive.org) to extract content from paywalled articles. The implementation adds a reactive bypass mechanism that attempts to fetch content from bypass services when paywall text is detected after the initial article fetch fails.
Changes:
- Added
paywalledHostslist inaddUTMParams()function (currently non-functional) - Implemented
tryPaywallBypass()function with fallback logic through multiple services - Integrated bypass attempts into the article processing flow when paywall content is detected
- Modified fallback summary logic to handle cases where
smartTruncatereturns empty
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ]; | ||
|
|
||
| for (const service of bypassServices) { | ||
| const bypassUrl = service.transform(url); |
There was a problem hiding this comment.
The URL parameter passed to the transform functions is not validated before being used in external service URLs. The URL should be validated using normalizeExternalUrl() (defined at line 319) to ensure it has a valid http/https protocol before constructing bypass service URLs. This prevents potential URL injection attacks.
| }); | ||
| return { success: true, html: response.data, service: service.name }; | ||
| } catch (e) { | ||
| // Continuer vers le service suivant |
There was a problem hiding this comment.
The error handling silently continues to the next service without any logging. This makes debugging difficult when paywall bypass attempts fail. Consider adding debug logging (similar to the pattern used in shouldSuppressExtractionLog) to track which services were attempted and why they failed, especially useful for improving the bypass logic over time.
| // Continuer vers le service suivant | |
| // Continuer vers le service suivant, mais journaliser en mode debug si activé | |
| if (process.env.DEBUG_PAYWALL_BYPASS === '1' || process.env.DEBUG_PAYWALL_BYPASS === 'true') { | |
| console.debug( | |
| `[paywall-bypass] Service "${service.name}" failed for URL "${url}" (bypass URL: "${bypassUrl}"):`, | |
| e && e.message ? e.message : e | |
| ); | |
| } |
| const writeFallbackLocalArticle = () => { | ||
| const safeTitle = sanitizeText(article.title) || 'Untitled'; | ||
| const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.'; | ||
| const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable for this article.'); |
There was a problem hiding this comment.
The fallback logic here uses both smartTruncate result and a ternary with sanitizeText(rawSummary) as fallback. However, the conditional rawSummary check may be redundant since rawSummary is already used in the first part of the OR expression. If smartTruncate returns an empty string when rawSummary is empty, the ternary expression (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable...') will still check rawSummary again. Consider simplifying to: smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.'
| const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable for this article.'); | |
| const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.'; |
| const bypassServices = [ | ||
| { | ||
| name: 'archive.ph', | ||
| transform: (u) => `https://archive.ph/?url=${encodeURIComponent(u)}` | ||
| }, | ||
| { | ||
| name: 'scribe.rip', | ||
| transform: (u) => u.includes('medium.com') ? u.replace('medium.com', 'scribe.rip') : null | ||
| }, | ||
| { | ||
| name: 'web.archive.org', | ||
| transform: (u) => `https://web.archive.org/web/*/${u}` | ||
| } | ||
| ]; | ||
|
|
||
| for (const service of bypassServices) { | ||
| const bypassUrl = service.transform(url); | ||
| if (!bypassUrl) continue; | ||
| try { | ||
| const response = await axios.get(bypassUrl, { | ||
| timeout: 5000, | ||
| headers: { 'User-Agent': 'AI-Pulse/3.0' } | ||
| }); | ||
| return { success: true, html: response.data, service: service.name }; |
There was a problem hiding this comment.
The bypass services (archive.ph, scribe.rip, web.archive.org) are external third-party services that may have rate limits, availability issues, or terms of service restrictions on automated access. There's no rate limiting, caching, or error tracking implemented. Consider: 1) implementing rate limiting to avoid being blocked, 2) caching successful bypass results to avoid repeated requests, 3) tracking service availability to skip consistently failing services, and 4) reviewing each service's terms of service for compliance with automated access policies.
| const tryPaywallBypass = async (url) => { | ||
| const bypassServices = [ | ||
| { | ||
| name: 'archive.ph', | ||
| transform: (u) => `https://archive.ph/?url=${encodeURIComponent(u)}` | ||
| }, | ||
| { | ||
| name: 'scribe.rip', | ||
| transform: (u) => u.includes('medium.com') ? u.replace('medium.com', 'scribe.rip') : null | ||
| }, | ||
| { | ||
| name: 'web.archive.org', | ||
| transform: (u) => `https://web.archive.org/web/*/${u}` | ||
| } | ||
| ]; | ||
|
|
||
| for (const service of bypassServices) { | ||
| const bypassUrl = service.transform(url); | ||
| if (!bypassUrl) continue; | ||
| try { | ||
| const response = await axios.get(bypassUrl, { | ||
| timeout: 5000, | ||
| headers: { 'User-Agent': 'AI-Pulse/3.0' } | ||
| }); | ||
| return { success: true, html: response.data, service: service.name }; | ||
| } catch (e) { | ||
| // Continuer vers le service suivant | ||
| } | ||
| } | ||
| return { success: false }; | ||
| }; |
There was a problem hiding this comment.
The codebase already implements a proactive paywall bypass strategy for Medium at line 260-262 where Medium URLs are rewritten to use freedium.cloud before the article is fetched. This new reactive approach (checking for paywall text after fetching) is inconsistent with the existing pattern. Consider either: 1) extending the proactive approach at lines 265-270 to rewrite URLs for paywalledHosts upfront (e.g., using archive.ph), or 2) documenting why a reactive approach is preferred for these specific domains.
| }, | ||
| { | ||
| name: 'web.archive.org', | ||
| transform: (u) => `https://web.archive.org/web/*/${u}` |
There was a problem hiding this comment.
The web.archive.org URL pattern uses a wildcard timestamp selector which will attempt to fetch a search results page rather than a specific archived version. This is unlikely to work for content extraction. Consider either using a specific timestamp API or removing this service as it won't reliably return article content that can be parsed by Readability.
| }, | |
| { | |
| name: 'web.archive.org', | |
| transform: (u) => `https://web.archive.org/web/*/${u}` |
| } | ||
| } |
There was a problem hiding this comment.
Inconsistent indentation: this line should be indented to match the level of the closing braces above and below it (lines 1693, 1695). The fs.writeFileSync call at line 1691 is at the correct indentation level, so this closing brace should align with the opening structure.
| } | |
| } | |
| } | |
| } |
| for (const service of bypassServices) { | ||
| const bypassUrl = service.transform(url); | ||
| if (!bypassUrl) continue; | ||
| try { | ||
| const response = await axios.get(bypassUrl, { | ||
| timeout: 5000, | ||
| headers: { 'User-Agent': 'AI-Pulse/3.0' } | ||
| }); | ||
| return { success: true, html: response.data, service: service.name }; | ||
| } catch (e) { | ||
| // Continuer vers le service suivant | ||
| } | ||
| } | ||
| return { success: false }; |
There was a problem hiding this comment.
The 5-second timeout for each bypass service could lead to significant delays when processing articles. With 3 services configured, a paywalled article could take up to 15 seconds before falling back. Consider: 1) reducing the timeout to 3000ms, 2) implementing parallel requests with Promise.race(), or 3) adding a configurable overall timeout for the entire bypass attempt to prevent feed processing delays.
| for (const service of bypassServices) { | |
| const bypassUrl = service.transform(url); | |
| if (!bypassUrl) continue; | |
| try { | |
| const response = await axios.get(bypassUrl, { | |
| timeout: 5000, | |
| headers: { 'User-Agent': 'AI-Pulse/3.0' } | |
| }); | |
| return { success: true, html: response.data, service: service.name }; | |
| } catch (e) { | |
| // Continuer vers le service suivant | |
| } | |
| } | |
| return { success: false }; | |
| // Timeout individuel par service et timeout global pour l'ensemble de la tentative | |
| const SERVICE_TIMEOUT_MS = 3000; | |
| const OVERALL_TIMEOUT_MS = 7000; | |
| return new Promise((resolve) => { | |
| const overallTimer = setTimeout(() => { | |
| resolve({ success: false, reason: 'timeout' }); | |
| }, OVERALL_TIMEOUT_MS); | |
| (async () => { | |
| for (const service of bypassServices) { | |
| const bypassUrl = service.transform(url); | |
| if (!bypassUrl) continue; | |
| try { | |
| const response = await axios.get(bypassUrl, { | |
| timeout: SERVICE_TIMEOUT_MS, | |
| headers: { 'User-Agent': 'AI-Pulse/3.0' } | |
| }); | |
| clearTimeout(overallTimer); | |
| return resolve({ success: true, html: response.data, service: service.name }); | |
| } catch (e) { | |
| // Continuer vers le service suivant en cas d'erreur ou de timeout | |
| } | |
| } | |
| clearTimeout(overallTimer); | |
| return resolve({ success: false }); | |
| })(); | |
| }); |
|
|
||
| // Liste des domaines avec paywalls stricts | ||
| const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com']; | ||
|
|
||
| // Ajouter Archive.ph en query parameter pour fallback | ||
| if (paywalledHosts.some(host => hostname.includes(host))) { | ||
| // On ne change pas l'URL ici, on l'utilisera comme fallback | ||
| } |
There was a problem hiding this comment.
This code block checks for paywalled hosts but doesn't perform any action. The comment indicates the URL will be used as a fallback, but no fallback mechanism is implemented here or referenced elsewhere in the addUTMParams function. This check should either implement the fallback logic or be removed as dead code.
| // Liste des domaines avec paywalls stricts | |
| const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com']; | |
| // Ajouter Archive.ph en query parameter pour fallback | |
| if (paywalledHosts.some(host => hostname.includes(host))) { | |
| // On ne change pas l'URL ici, on l'utilisera comme fallback | |
| } |
| const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com']; | ||
|
|
||
| // Ajouter Archive.ph en query parameter pour fallback | ||
| if (paywalledHosts.some(host => hostname.includes(host))) { |
There was a problem hiding this comment.
The hostname matching uses includes() which is vulnerable to subdomain bypass attacks. For example, a malicious domain like "evil-ft.com.attacker.com" would match "ft.com". Use the existing hostnameMatches() helper function (defined at line 584) which implements secure suffix-based matching, or use the pattern: host === domain || host.endsWith('.'+domain).
| if (paywalledHosts.some(host => hostname.includes(host))) { | |
| if (paywalledHosts.some(domain => hostname === domain || hostname.endsWith('.' + domain))) { |
…xtraction
Continue Tasks:▶️ 1 queued — View all