Server URL Extractor & Validator

Fast, reliable extraction & validation of URLs from dynamic pages — using curl-first with Playwright fallback.

Important

This project was built as part of real-world work experience for a company.

Designed for:

Efficient batch processing of server-side URL drops
Smart duplicate prevention
DNS-aware validation

Built with resilience and scale in mind — perfect for processing large datasets without reprocessing the same work twice.

Tip

A SQLite-based version is available in a dedicated branch for lightweight, persistent storage.

🔧 Features

Automated navigation across multiple /md/xxxxx.html pages, decrementing through URLs.
Dual source extraction from <textarea> placeholders and valid anchor <a href=""> tags within each drop.
Robust regex filters to exclude placeholder and anchor patterns, targeting only real URLs with allowed TLDs and excluding false positives.
Smart skipping logic:
- Skips scraping if a dropId is already present in dropLinks.json
- Skips validation if a batchId is fully present in validatedLinks.json
Batch-based processing saves links incrementally as dropId_drop_N batches to control memory and improve clarity.
Duplicate-free batching: avoids saving the same link twice within a batch.
Status validation:
- Uses curl for fast, lightweight URL status checking
- Automatically falls back to Playwright for rich browser-level checks if curl fails or gives uncertain output.
Redirection detection compares normalized final URLs to identify real redirects and capture redirected_url.
DNS error detection classifies failures like ENOTFOUND, EAI_AGAIN, and treats them distinctly with zero status.
Secure credential injection using .env variables for login automation
Memory usage tracking logs RAM snapshots after every 10 placeholder tabs processed.
Detailed console logging helps monitor:
- URL extraction steps
- Status checks
- Validation decisions (curl vs playwright)
- Skip reasons and timing
Structured JSON output:
- Scraped links → data/dropLinks.json
- Validated links → data/validatedLinks.json
- Grouped by batchId, each link contains:
  - original: source URL
  - status: HTTP status code
  - redirection: true/false
  - redirected_url: final URL if redirection happened
  - included: boolean match for known target IDs
  - method: "curl" or "playwright"
  - error: if present (e.g. "DNS could not be resolved")

✅ Core Tasks Done

1. Link Extraction

Extracted URLs from placeholders in textareas with regex, including http(s), www, and protocol-relative URLs (//...).
Built a helper to extract both placeholder links and real anchor <a href=""> links per drop.

2. Duplicate Handling

Used Set logic to avoid duplicate URLs within each drop batch.
Skipped already saved drops (dropLinks.json) and already validated batches (validatedLinks.json) to prevent reprocessing.

3. Batch Accumulation & Saving

Grouped links into drop-specific batches: dropId_drop_N.
Merged links from placeholders and anchors into a single batch.
Saved batches incrementally to JSON to avoid memory overflow.

4. Navigation & Validation Loop

Decremented through /md/{id}.html pages in a loop using Playwright automation.
Validated extracted links using curl for speed.
Automatically fell back to Playwright for browser-level validation if curl failed or gave ambiguous results.
Captured and stored HTTP status, redirection info, final URL, and method used.

5. Inclusion Mapping (Optional Analysis)

Compared resolved URLs against a predefined list of numeric target IDs.
Marked each validated link with included: true/false depending on match.
Enables later filtering and analysis based on external reference lists.

6. Regex Improvements

Refined regex patterns to allow a wide variety of real URLs while filtering out false positives like contact.first_name}}.
Added support for extended TLDs and shorteners (.me, .li, .in, .moe, etc.).

7. Memory Management & Debugging

Logged memory usage every 10 tabs to track performance.
Introduced async timeouts and batch size limits to keep Playwright stable during heavy runs.

8. Environment Handling

Introduced .env config for secure credentials (SERVER_EMAIL, SERVER_PASSWORD).
Included .env.example for team usage without exposing secrets.
Uses .env credentials in Playwright login tests with strict TypeScript handling.

🚀 Usage

This section covers everything you need to set up, run the server, and execute the web scrape for the JSON branch.

1. Install dependencies

npm install

2. Set up your environment

cp .env.example .env

Then define your credentials:

SERVER_EMAIL=your@email.com
SERVER_PASSWORD=yourPassword

3. Prepare storage:

Copy the example JSON files before running:

cp data/dropLinks.example.json data/dropLinks.json
cp data/validatedLinks.example.json data/validatedLinks.json

4. Run the Web Scrape

.\run-tests.bat

Executes the full web scraping and URL validation workflow
Saves results to the JSON files: dropLinks.json and validatedLinks.json

⚠️ Server runs on localhost:3000 by default. Example endpoint: /api/validated-links

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
backend		backend
database		database
frontend		frontend
helpers		helpers
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
playwright.config.js		playwright.config.js
run-tests.bat		run-tests.bat
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Server URL Extractor & Validator

📚 Table of Contents

🔧 Features

✅ Core Tasks Done

1. Link Extraction

2. Duplicate Handling

3. Batch Accumulation & Saving

4. Navigation & Validation Loop

5. Inclusion Mapping (Optional Analysis)

6. Regex Improvements

7. Memory Management & Debugging

8. Environment Handling

🚀 Usage

1. Install dependencies

2. Set up your environment

3. Prepare storage:

4. Run the Web Scrape

📄 License

About

Uh oh!

Releases

Packages

Languages

License

houssamouhra/server-url-extractor

Folders and files

Latest commit

History

Repository files navigation

Server URL Extractor & Validator

📚 Table of Contents

🔧 Features

✅ Core Tasks Done

1. Link Extraction

2. Duplicate Handling

3. Batch Accumulation & Saving

4. Navigation & Validation Loop

5. Inclusion Mapping (Optional Analysis)

6. Regex Improvements

7. Memory Management & Debugging

8. Environment Handling

🚀 Usage

1. Install dependencies

2. Set up your environment

3. Prepare storage:

4. Run the Web Scrape

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages