feat: Add Archive.ph and paywall bypass services for better article e… by EthanThePhoenix38 · Pull Request #122 · ThePhoenixAgency/AI-Pulse

EthanThePhoenix38 · 2026-02-18T19:47:59Z

…xtraction

Continue Tasks: ▶️ 1 queued — View all

…xtraction

Copilot

Pull request overview

This PR adds paywall bypass functionality using third-party services (archive.ph, scribe.rip, web.archive.org) to extract content from paywalled articles. The implementation adds a reactive bypass mechanism that attempts to fetch content from bypass services when paywall text is detected after the initial article fetch fails.

Changes:

Added paywalledHosts list in addUTMParams() function (currently non-functional)
Implemented tryPaywallBypass() function with fallback logic through multiple services
Integrated bypass attempts into the article processing flow when paywall content is detected
Modified fallback summary logic to handle cases where smartTruncate returns empty

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-18T19:56:30Z

src/aggregator.js

+    ];
+
+    for (const service of bypassServices) {
+      const bypassUrl = service.transform(url);


The URL parameter passed to the transform functions is not validated before being used in external service URLs. The URL should be validated using normalizeExternalUrl() (defined at line 319) to ensure it has a valid http/https protocol before constructing bypass service URLs. This prevents potential URL injection attacks.

Copilot · 2026-02-18T19:56:30Z

src/aggregator.js

+        });
+        return { success: true, html: response.data, service: service.name };
+      } catch (e) {
+        // Continuer vers le service suivant


The error handling silently continues to the next service without any logging. This makes debugging difficult when paywall bypass attempts fail. Consider adding debug logging (similar to the pattern used in shouldSuppressExtractionLog) to track which services were attempted and why they failed, especially useful for improving the bypass logic over time.

Suggested change

// Continuer vers le service suivant

// Continuer vers le service suivant, mais journaliser en mode debug si activé

if (process.env.DEBUG_PAYWALL_BYPASS === '1' || process.env.DEBUG_PAYWALL_BYPASS === 'true') {

console.debug(

`[paywall-bypass] Service "${service.name}" failed for URL "${url}" (bypass URL: "${bypassUrl}"):`,

e && e.message ? e.message : e

);

}

Copilot · 2026-02-18T19:56:30Z

src/aggregator.js

  const writeFallbackLocalArticle = () => {
    const safeTitle = sanitizeText(article.title) || 'Untitled';
-    const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.';
+    const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable for this article.');


The fallback logic here uses both smartTruncate result and a ternary with sanitizeText(rawSummary) as fallback. However, the conditional rawSummary check may be redundant since rawSummary is already used in the first part of the OR expression. If smartTruncate returns an empty string when rawSummary is empty, the ternary expression (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable...') will still check rawSummary again. Consider simplifying to: smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.'

Suggested change

const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable for this article.');

const safeSummary = smartTruncate(cleanupNoiseText(rawSummary || ''), 1200) || 'Summary unavailable for this article.';

Copilot · 2026-02-18T19:56:31Z

src/aggregator.js

+    const bypassServices = [
+      {
+        name: 'archive.ph',
+        transform: (u) => `https://archive.ph/?url=${encodeURIComponent(u)}`
+      },
+      {
+        name: 'scribe.rip',
+        transform: (u) => u.includes('medium.com') ? u.replace('medium.com', 'scribe.rip') : null
+      },
+      {
+        name: 'web.archive.org',
+        transform: (u) => `https://web.archive.org/web/*/${u}`
+      }
+    ];
+
+    for (const service of bypassServices) {
+      const bypassUrl = service.transform(url);
+      if (!bypassUrl) continue;
+      try {
+        const response = await axios.get(bypassUrl, {
+          timeout: 5000,
+          headers: { 'User-Agent': 'AI-Pulse/3.0' }
+        });
+        return { success: true, html: response.data, service: service.name };


The bypass services (archive.ph, scribe.rip, web.archive.org) are external third-party services that may have rate limits, availability issues, or terms of service restrictions on automated access. There's no rate limiting, caching, or error tracking implemented. Consider: 1) implementing rate limiting to avoid being blocked, 2) caching successful bypass results to avoid repeated requests, 3) tracking service availability to skip consistently failing services, and 4) reviewing each service's terms of service for compliance with automated access policies.

Copilot · 2026-02-18T19:56:31Z

src/aggregator.js

+  const tryPaywallBypass = async (url) => {
+    const bypassServices = [
+      {
+        name: 'archive.ph',
+        transform: (u) => `https://archive.ph/?url=${encodeURIComponent(u)}`
+      },
+      {
+        name: 'scribe.rip',
+        transform: (u) => u.includes('medium.com') ? u.replace('medium.com', 'scribe.rip') : null
+      },
+      {
+        name: 'web.archive.org',
+        transform: (u) => `https://web.archive.org/web/*/${u}`
+      }
+    ];
+
+    for (const service of bypassServices) {
+      const bypassUrl = service.transform(url);
+      if (!bypassUrl) continue;
+      try {
+        const response = await axios.get(bypassUrl, {
+          timeout: 5000,
+          headers: { 'User-Agent': 'AI-Pulse/3.0' }
+        });
+        return { success: true, html: response.data, service: service.name };
+      } catch (e) {
+        // Continuer vers le service suivant
+      }
+    }
+    return { success: false };
+  };


The codebase already implements a proactive paywall bypass strategy for Medium at line 260-262 where Medium URLs are rewritten to use freedium.cloud before the article is fetched. This new reactive approach (checking for paywall text after fetching) is inconsistent with the existing pattern. Consider either: 1) extending the proactive approach at lines 265-270 to rewrite URLs for paywalledHosts upfront (e.g., using archive.ph), or 2) documenting why a reactive approach is preferred for these specific domains.

Copilot · 2026-02-18T19:56:32Z

src/aggregator.js

+      },
+      {
+        name: 'web.archive.org',
+        transform: (u) => `https://web.archive.org/web/*/${u}`


The web.archive.org URL pattern uses a wildcard timestamp selector which will attempt to fetch a search results page rather than a specific archived version. This is unlikely to work for content extraction. Consider either using a specific timestamp API or removing this service as it won't reliably return article content that can be parsed by Readability.

Suggested change

},

{

name: 'web.archive.org',

transform: (u) => `https://web.archive.org/web/*/${u}`

Copilot · 2026-02-18T19:56:32Z

src/aggregator.js

+          }
        }


Inconsistent indentation: this line should be indented to match the level of the closing braces above and below it (lines 1693, 1695). The fs.writeFileSync call at line 1691 is at the correct indentation level, so this closing brace should align with the opening structure.

Suggested change

}

}

}

}

Copilot · 2026-02-18T19:56:32Z

src/aggregator.js

+    for (const service of bypassServices) {
+      const bypassUrl = service.transform(url);
+      if (!bypassUrl) continue;
+      try {
+        const response = await axios.get(bypassUrl, {
+          timeout: 5000,
+          headers: { 'User-Agent': 'AI-Pulse/3.0' }
+        });
+        return { success: true, html: response.data, service: service.name };
+      } catch (e) {
+        // Continuer vers le service suivant
+      }
+    }
+    return { success: false };


The 5-second timeout for each bypass service could lead to significant delays when processing articles. With 3 services configured, a paywalled article could take up to 15 seconds before falling back. Consider: 1) reducing the timeout to 3000ms, 2) implementing parallel requests with Promise.race(), or 3) adding a configurable overall timeout for the entire bypass attempt to prevent feed processing delays.

Suggested change

for (const service of bypassServices) {

const bypassUrl = service.transform(url);

if (!bypassUrl) continue;

try {

const response = await axios.get(bypassUrl, {

timeout: 5000,

headers: { 'User-Agent': 'AI-Pulse/3.0' }

});

return { success: true, html: response.data, service: service.name };

} catch (e) {

// Continuer vers le service suivant

}

}

return { success: false };

// Timeout individuel par service et timeout global pour l'ensemble de la tentative

const SERVICE_TIMEOUT_MS = 3000;

const OVERALL_TIMEOUT_MS = 7000;

return new Promise((resolve) => {

const overallTimer = setTimeout(() => {

resolve({ success: false, reason: 'timeout' });

}, OVERALL_TIMEOUT_MS);

(async () => {

for (const service of bypassServices) {

const bypassUrl = service.transform(url);

if (!bypassUrl) continue;

try {

const response = await axios.get(bypassUrl, {

timeout: SERVICE_TIMEOUT_MS,

headers: { 'User-Agent': 'AI-Pulse/3.0' }

});

clearTimeout(overallTimer);

return resolve({ success: true, html: response.data, service: service.name });

} catch (e) {

// Continuer vers le service suivant en cas d'erreur ou de timeout

}

}

clearTimeout(overallTimer);

return resolve({ success: false });

})();

});

Copilot · 2026-02-18T19:56:33Z

src/aggregator.js

+
+    // Liste des domaines avec paywalls stricts
+    const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com'];
+
+    // Ajouter Archive.ph en query parameter pour fallback
+    if (paywalledHosts.some(host => hostname.includes(host))) {
+      // On ne change pas l'URL ici, on l'utilisera comme fallback
+    }


This code block checks for paywalled hosts but doesn't perform any action. The comment indicates the URL will be used as a fallback, but no fallback mechanism is implemented here or referenced elsewhere in the addUTMParams function. This check should either implement the fallback logic or be removed as dead code.

Suggested change

// Liste des domaines avec paywalls stricts

const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com'];

// Ajouter Archive.ph en query parameter pour fallback

if (paywalledHosts.some(host => hostname.includes(host))) {

// On ne change pas l'URL ici, on l'utilisera comme fallback

}

Copilot · 2026-02-18T19:56:33Z

src/aggregator.js

+    const paywalledHosts = ['ft.com', 'wsj.com', 'economist.com', 'bloomberg.com', 'investing.com'];
+
+    // Ajouter Archive.ph en query parameter pour fallback
+    if (paywalledHosts.some(host => hostname.includes(host))) {


The hostname matching uses includes() which is vulnerable to subdomain bypass attacks. For example, a malicious domain like "evil-ft.com.attacker.com" would match "ft.com". Use the existing hostnameMatches() helper function (defined at line 584) which implements secure suffix-based matching, or use the pattern: host === domain || host.endsWith('.'+domain).

Suggested change

if (paywalledHosts.some(host => hostname.includes(host))) {

if (paywalledHosts.some(domain => hostname === domain || hostname.endsWith('.' + domain))) {

feat: Add Archive.ph and paywall bypass services for better article e…

f3c4bb3

…xtraction

EthanThePhoenix38 marked this pull request as ready for review February 18, 2026 19:48

Copilot AI review requested due to automatic review settings February 18, 2026 19:48

EthanThePhoenix38 merged commit 4694407 into main Feb 18, 2026
5 checks passed

EthanThePhoenix38 deleted the claude/paywall-bypass-G1Gdx branch February 18, 2026 19:48

github-actions bot added the backend label Feb 18, 2026

Copilot started reviewing on behalf of EthanThePhoenix38 February 18, 2026 19:50 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Archive.ph and paywall bypass services for better article e…#122

feat: Add Archive.ph and paywall bypass services for better article e…#122
EthanThePhoenix38 merged 1 commit intomainfrom
claude/paywall-bypass-G1Gdx

EthanThePhoenix38 commented Feb 18, 2026 •

edited by continue bot

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        // Continuer vers le service suivant
+        // Continuer vers le service suivant, mais journaliser en mode debug si activé
+        if (process.env.DEBUG_PAYWALL_BYPASS === '1' || process.env.DEBUG_PAYWALL_BYPASS === 'true') {
+          console.debug(
+            `[paywall-bypass] Service "${service.name}" failed for URL "${url}" (bypass URL: "${bypassUrl}"):`,
+            e && e.message ? e.message : e
+          );
+        }

	const safeSummary = smartTruncate(cleanupNoiseText(rawSummary \|\| ''), 1200) \|\| (rawSummary ? sanitizeText(rawSummary) : 'Summary unavailable for this article.');
	const safeSummary = smartTruncate(cleanupNoiseText(rawSummary \|\| ''), 1200) \|\| 'Summary unavailable for this article.';

-          }
-        }
+        }
+        }

	if (paywalledHosts.some(host => hostname.includes(host))) {
	if (paywalledHosts.some(domain => hostname === domain \|\| hostname.endsWith('.' + domain))) {

Uh oh!

Conversation

EthanThePhoenix38 commented Feb 18, 2026 • edited by continue bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EthanThePhoenix38 commented Feb 18, 2026 •

edited by continue bot

Loading