-
-
Notifications
You must be signed in to change notification settings - Fork 1
feat: Add Archive.ph and paywall bypass services for better article e… #122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -260,6 +260,14 @@ function addUTMParams(url, category = 'general') { | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if (mediumHosts.includes(hostname)) { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| url = `https://freedium.cloud/${url}`; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // Liste des domaines avec paywalls stricts | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com']; | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // Ajouter Archive.ph en query parameter pour fallback | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if (paywalledHosts.some(host => hostname.includes(host))) { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // On ne change pas l'URL ici, on l'utilisera comme fallback | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+263
to
+270
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // Liste des domaines avec paywalls stricts | |
| const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com']; | |
| // Ajouter Archive.ph en query parameter pour fallback | |
| if (paywalledHosts.some(host => hostname.includes(host))) { | |
| // On ne change pas l'URL ici, on l'utilisera comme fallback | |
| } |
Copilot
AI
Feb 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scribe.rip transform uses a simple string replacement with includes() and replace(). This approach is unsafe as it doesn't properly validate hostnames and could match partial strings. For example, "notmedium.com" would match and be incorrectly transformed. Use the hostnameMatches() helper function to ensure proper domain matching.
| transform: (u) => u.includes('medium.com') ? u.replace('medium.com', 'scribe.rip') : null | |
| transform: (u) => { | |
| try { | |
| const urlObj = new URL(u); | |
| if (!hostnameMatches || !hostnameMatches(urlObj.hostname, 'medium.com')) { | |
| return null; | |
| } | |
| urlObj.hostname = 'scribe.rip'; | |
| return urlObj.toString(); | |
| } catch { | |
| return null; | |
| } | |
| } |
Copilot
AI
Feb 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The web.archive.org URL pattern uses a wildcard timestamp selector which will attempt to fetch a search results page rather than a specific archived version. This is unlikely to work for content extraction. Consider either using a specific timestamp API or removing this service as it won't reliably return article content that can be parsed by Readability.
| }, | |
| { | |
| name: 'web.archive.org', | |
| transform: (u) => `https://web.archive.org/web/*/${u}` |
Copilot
AI
Feb 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL parameter passed to the transform functions is not validated before being used in external service URLs. The URL should be validated using normalizeExternalUrl() (defined at line 319) to ensure it has a valid http/https protocol before constructing bypass service URLs. This prevents potential URL injection attacks.
Copilot
AI
Feb 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bypass services (archive.ph, scribe.rip, web.archive.org) are external third-party services that may have rate limits, availability issues, or terms of service restrictions on automated access. There's no rate limiting, caching, or error tracking implemented. Consider: 1) implementing rate limiting to avoid being blocked, 2) caching successful bypass results to avoid repeated requests, 3) tracking service availability to skip consistently failing services, and 4) reviewing each service's terms of service for compliance with automated access policies.
Copilot
AI
Feb 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling silently continues to the next service without any logging. This makes debugging difficult when paywall bypass attempts fail. Consider adding debug logging (similar to the pattern used in shouldSuppressExtractionLog) to track which services were attempted and why they failed, especially useful for improving the bypass logic over time.
| // Continuer vers le service suivant | |
| // Continuer vers le service suivant, mais journaliser en mode debug si activé | |
| if (process.env.DEBUG_PAYWALL_BYPASS === '1' || process.env.DEBUG_PAYWALL_BYPASS === 'true') { | |
| console.debug( | |
| `[paywall-bypass] Service "${service.name}" failed for URL "${url}" (bypass URL: "${bypassUrl}"):`, | |
| e && e.message ? e.message : e | |
| ); | |
| } |
Copilot
AI
Feb 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 5-second timeout for each bypass service could lead to significant delays when processing articles. With 3 services configured, a paywalled article could take up to 15 seconds before falling back. Consider: 1) reducing the timeout to 3000ms, 2) implementing parallel requests with Promise.race(), or 3) adding a configurable overall timeout for the entire bypass attempt to prevent feed processing delays.
| for (const service of bypassServices) { | |
| const bypassUrl = service.transform(url); | |
| if (!bypassUrl) continue; | |
| try { | |
| const response = await axios.get(bypassUrl, { | |
| timeout: 5000, | |
| headers: { 'User-Agent': 'AI-Pulse/3.0' } | |
| }); | |
| return { success: true, html: response.data, service: service.name }; | |
| } catch (e) { | |
| // Continuer vers le service suivant | |
| } | |
| } | |
| return { success: false }; | |
| // Timeout individuel par service et timeout global pour l'ensemble de la tentative | |
| const SERVICE_TIMEOUT_MS = 3000; | |
| const OVERALL_TIMEOUT_MS = 7000; | |
| return new Promise((resolve) => { | |
| const overallTimer = setTimeout(() => { | |
| resolve({ success: false, reason: 'timeout' }); | |
| }, OVERALL_TIMEOUT_MS); | |
| (async () => { | |
| for (const service of bypassServices) { | |
| const bypassUrl = service.transform(url); | |
| if (!bypassUrl) continue; | |
| try { | |
| const response = await axios.get(bypassUrl, { | |
| timeout: SERVICE_TIMEOUT_MS, | |
| headers: { 'User-Agent': 'AI-Pulse/3.0' } | |
| }); | |
| clearTimeout(overallTimer); | |
| return resolve({ success: true, html: response.data, service: service.name }); | |
| } catch (e) { | |
| // Continuer vers le service suivant en cas d'erreur ou de timeout | |
| } | |
| } | |
| clearTimeout(overallTimer); | |
| return resolve({ success: false }); | |
| })(); | |
| }); |
Copilot
AI
Feb 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codebase already implements a proactive paywall bypass strategy for Medium at line 260-262 where Medium URLs are rewritten to use freedium.cloud before the article is fetched. This new reactive approach (checking for paywall text after fetching) is inconsistent with the existing pattern. Consider either: 1) extending the proactive approach at lines 265-270 to rewrite URLs for paywalledHosts upfront (e.g., using archive.ph), or 2) documenting why a reactive approach is preferred for these specific domains.
Copilot
AI
Feb 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback logic here uses both smartTruncate result and a ternary with sanitizeText(rawSummary) as fallback. However, the conditional rawSummary check may be redundant since rawSummary is already used in the first part of the OR expression. If smartTruncate returns an empty string when rawSummary is empty, the ternary expression (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable...') will still check rawSummary again. Consider simplifying to: smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.'
| const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable for this article.'); | |
| const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.'; |
Copilot
AI
Feb 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the paywall bypass succeeds and articleContent is updated (line 1569), the code continues but doesn't return early. This means execution falls through to line 1577 (the else block), which expects articleContent to not have paywall text. However, since we're still in the "if (isPaywallText)" block (line 1561), we never reach the processing code in the else block at lines 1577-1691. The successful bypass scenario should either: 1) set a flag and restructure the logic to process the bypassed content, or 2) refactor to extract the processing logic into a shared code path that can be reached from both the bypass success and normal content paths.
Copilot
AI
Feb 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation: this line should be indented to match the level of the closing braces above and below it (lines 1693, 1695). The fs.writeFileSync call at line 1691 is at the correct indentation level, so this closing brace should align with the opening structure.
| } | |
| } | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hostname matching uses includes() which is vulnerable to subdomain bypass attacks. For example, a malicious domain like "evil-ft.com.attacker.com" would match "ft.com". Use the existing hostnameMatches() helper function (defined at line 584) which implements secure suffix-based matching, or use the pattern: host === domain || host.endsWith('.'+domain).