Fast, reliable extraction & validation of URLs from dynamic pages — using curl-first with Playwright fallback.
Important
This project was built as part of real-world work experience for a company.
Designed for:
- Efficient batch processing of server-side URL drops
- Smart duplicate prevention
- DNS-aware validation
Built with resilience and scale in mind — perfect for processing large datasets without reprocessing the same work twice.
Tip
A SQLite-based version is available in a dedicated branch for lightweight, persistent storage.
- Automated navigation across multiple
/md/xxxxx.htmlpages, decrementing through URLs. - Dual source extraction from
<textarea>placeholders and valid anchor<a href="">tags within each drop. - Robust regex filters to exclude placeholder and anchor patterns, targeting only real URLs with allowed TLDs and excluding false positives.
- Smart skipping logic:
- Skips scraping if a dropId is already present in
dropLinks.json - Skips validation if a batchId is fully present in
validatedLinks.json
- Skips scraping if a dropId is already present in
- Batch-based processing saves links incrementally as
dropId_drop_Nbatches to control memory and improve clarity. - Duplicate-free batching: avoids saving the same link twice within a batch.
- Status validation:
- Uses
curlfor fast, lightweight URL status checking - Automatically falls back to
Playwrightfor rich browser-level checks if curl fails or gives uncertain output.
- Uses
- Redirection detection compares normalized final URLs to identify real redirects and capture redirected_url.
- DNS error detection classifies failures like
ENOTFOUND,EAI_AGAIN, and treats them distinctly with zero status. - Secure credential injection using
.envvariables for login automation - Memory usage tracking logs RAM snapshots after every 10 placeholder tabs processed.
- Detailed console logging helps monitor:
- URL extraction steps
- Status checks
- Validation decisions (curl vs playwright)
- Skip reasons and timing
- Structured JSON output:
- Scraped links →
data/dropLinks.json - Validated links →
data/validatedLinks.json - Grouped by
batchId, each link contains:original: source URLstatus: HTTP status coderedirection: true/falseredirected_url: final URL if redirection happenedincluded: boolean match for known target IDsmethod:"curl"or"playwright"error: if present (e.g."DNS could not be resolved")
- Scraped links →
- Extracted URLs from placeholders in textareas with regex, including
http(s),www, and protocol-relative URLs (//...). - Built a helper to extract both placeholder links and real anchor
<a href="">links per drop.
- Used
Setlogic to avoid duplicate URLs within each drop batch. - Skipped already saved drops (
dropLinks.json) and already validated batches (validatedLinks.json) to prevent reprocessing.
- Grouped links into drop-specific batches:
dropId_drop_N. - Merged links from placeholders and anchors into a single batch.
- Saved batches incrementally to JSON to avoid memory overflow.
- Decremented through
/md/{id}.htmlpages in a loop using Playwright automation. - Validated extracted links using
curlfor speed. - Automatically fell back to Playwright for browser-level validation if curl failed or gave ambiguous results.
- Captured and stored HTTP status, redirection info, final URL, and method used.
- Compared resolved URLs against a predefined list of numeric target IDs.
- Marked each validated link with
included: true/falsedepending on match. - Enables later filtering and analysis based on external reference lists.
- Refined regex patterns to allow a wide variety of real URLs while filtering out false positives like
contact.first_name}}. - Added support for extended TLDs and shorteners (
.me,.li,.in,.moe, etc.).
- Logged memory usage every 10 tabs to track performance.
- Introduced async timeouts and batch size limits to keep Playwright stable during heavy runs.
- Introduced
.envconfig for secure credentials (SERVER_EMAIL,SERVER_PASSWORD). - Included
.env.examplefor team usage without exposing secrets. - Uses
.envcredentials in Playwright login tests with strict TypeScript handling.
This section covers everything you need to set up, run the server, and execute the web scrape for the JSON branch.
npm installcp .env.example .envThen define your credentials:
SERVER_EMAIL=your@email.com
SERVER_PASSWORD=yourPassword- Copy the example JSON files before running:
cp data/dropLinks.example.json data/dropLinks.json
cp data/validatedLinks.example.json data/validatedLinks.json.\run-tests.bat- Executes the full web scraping and URL validation workflow
- Saves results to the JSON files:
dropLinks.jsonandvalidatedLinks.json
⚠️ Server runs onlocalhost:3000by default. Example endpoint:/api/validated-links
This project is licensed under the MIT License.