Skip to content

houssamouhra/server-url-extractor

Repository files navigation

Server URL Extractor & Validator

Fast, reliable extraction & validation of URLs from dynamic pages — using curl-first with Playwright fallback.

SQLite branch   Node.js   TypeScript   curl   MIT License

Important

This project was built as part of real-world work experience for a company.

Designed for:

  • Efficient batch processing of server-side URL drops
  • Smart duplicate prevention
  • DNS-aware validation

Built with resilience and scale in mind — perfect for processing large datasets without reprocessing the same work twice.

Tip

A SQLite-based version is available in a dedicated branch for lightweight, persistent storage.

📚 Table of Contents

🔧 Features

  • Automated navigation across multiple /md/xxxxx.html pages, decrementing through URLs.
  • Dual source extraction from <textarea> placeholders and valid anchor <a href=""> tags within each drop.
  • Robust regex filters to exclude placeholder and anchor patterns, targeting only real URLs with allowed TLDs and excluding false positives.
  • Smart skipping logic:
    • Skips scraping if a dropId is already present in dropLinks.json
    • Skips validation if a batchId is fully present in validatedLinks.json
  • Batch-based processing saves links incrementally as dropId_drop_N batches to control memory and improve clarity.
  • Duplicate-free batching: avoids saving the same link twice within a batch.
  • Status validation:
    • Uses curl for fast, lightweight URL status checking
    • Automatically falls back to Playwright for rich browser-level checks if curl fails or gives uncertain output.
  • Redirection detection compares normalized final URLs to identify real redirects and capture redirected_url.
  • DNS error detection classifies failures like ENOTFOUND, EAI_AGAIN, and treats them distinctly with zero status.
  • Secure credential injection using .env variables for login automation
  • Memory usage tracking logs RAM snapshots after every 10 placeholder tabs processed.
  • Detailed console logging helps monitor:
    • URL extraction steps
    • Status checks
    • Validation decisions (curl vs playwright)
    • Skip reasons and timing
  • Structured JSON output:
    • Scraped links → data/dropLinks.json
    • Validated links → data/validatedLinks.json
    • Grouped by batchId, each link contains:
      • original: source URL
      • status: HTTP status code
      • redirection: true/false
      • redirected_url: final URL if redirection happened
      • included: boolean match for known target IDs
      • method: "curl" or "playwright"
      • error: if present (e.g. "DNS could not be resolved")

✅ Core Tasks Done

1. Link Extraction

  • Extracted URLs from placeholders in textareas with regex, including http(s), www, and protocol-relative URLs (//...).
  • Built a helper to extract both placeholder links and real anchor <a href=""> links per drop.

2. Duplicate Handling

  • Used Set logic to avoid duplicate URLs within each drop batch.
  • Skipped already saved drops (dropLinks.json) and already validated batches (validatedLinks.json) to prevent reprocessing.

3. Batch Accumulation & Saving

  • Grouped links into drop-specific batches: dropId_drop_N.
  • Merged links from placeholders and anchors into a single batch.
  • Saved batches incrementally to JSON to avoid memory overflow.

4. Navigation & Validation Loop

  • Decremented through /md/{id}.html pages in a loop using Playwright automation.
  • Validated extracted links using curl for speed.
  • Automatically fell back to Playwright for browser-level validation if curl failed or gave ambiguous results.
  • Captured and stored HTTP status, redirection info, final URL, and method used.

5. Inclusion Mapping (Optional Analysis)

  • Compared resolved URLs against a predefined list of numeric target IDs.
  • Marked each validated link with included: true/false depending on match.
  • Enables later filtering and analysis based on external reference lists.

6. Regex Improvements

  • Refined regex patterns to allow a wide variety of real URLs while filtering out false positives like contact.first_name}}.
  • Added support for extended TLDs and shorteners (.me, .li, .in, .moe, etc.).

7. Memory Management & Debugging

  • Logged memory usage every 10 tabs to track performance.
  • Introduced async timeouts and batch size limits to keep Playwright stable during heavy runs.

8. Environment Handling

  • Introduced .env config for secure credentials (SERVER_EMAIL, SERVER_PASSWORD).
  • Included .env.example for team usage without exposing secrets.
  • Uses .env credentials in Playwright login tests with strict TypeScript handling.

🚀 Usage

This section covers everything you need to set up, run the server, and execute the web scrape for the JSON branch.

1. Install dependencies

npm install

2. Set up your environment

cp .env.example .env

Then define your credentials:

SERVER_EMAIL=your@email.com
SERVER_PASSWORD=yourPassword

3. Prepare storage:

  • Copy the example JSON files before running:
cp data/dropLinks.example.json data/dropLinks.json
cp data/validatedLinks.example.json data/validatedLinks.json

4. Run the Web Scrape

.\run-tests.bat
  • Executes the full web scraping and URL validation workflow
  • Saves results to the JSON files: dropLinks.json and validatedLinks.json

⚠️ Server runs on localhost:3000 by default. Example endpoint: /api/validated-links

📄 License

This project is licensed under the MIT License.