Skip to content

A service which has an endpoint to extract data from an URL using CSS selectors

License

Notifications You must be signed in to change notification settings

netsi1964/extract-content

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

58 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

extract-content

A lightweight web scraping service that extracts data from any website using CSS selectors. Available in both Node.js and Deno versions.

πŸš€ Quick Start

Deno Version (Recommended)

# Start the server
deno task start

# Or with hot reload for development
deno task dev

# Run tests
deno test --allow-net --allow-env

Node.js Version (Legacy)

npm start

πŸ“– Resources

πŸ“‘ API Endpoints

GET / - Extract Text Content

Extracts text content from HTML elements using CSS selectors.

Parameters:

Parameter Description Required
from URL to fetch content from Yes
extract JSON object mapping names to selectors Yes

Response: JSON object with extracted text

GET /html - Extract HTML Content

Extracts HTML markup or returns raw HTML from a page.

Parameters:

Parameter Description Required
from URL to fetch content from Yes
extract JSON object mapping names to selectors No

Response: JSON object with extracted HTML (or { html: rawHtml } if no extract parameter)

GET /raw - Raw Proxy

Returns the raw HTML from the target URL (acts as a simple proxy).

Parameters:

Parameter Description Required
from URL to fetch content from Yes

Response: Raw HTML content

πŸ’‘ Examples

All examples below use localhost:8000. Start the server first with deno task dev or deno task start.

Note: The extract parameter must be URL-encoded. Use encodeURIComponent() in JavaScript or any URL encoder.

Example 1: Extract Wikipedia Article Title and First Paragraph

Extract the title and introduction from a Wikipedia article:

// What we want to extract
const extract = {
  "title": "h1",
  "intro": ".mw-parser-output > p",
};

// URL-encode it
const encoded = encodeURIComponent(JSON.stringify(extract));
// Result: %7B%22title%22%3A%22h1%22%2C%22intro%22%3A%22.mw-parser-output%20%3E%20p%22%7D

Try it: Extract from Wikipedia - Deno

Example 2: Extract GitHub Repository Info

Get repository name, description, and star count:

const extract = {
  "repoName": "h1 strong a",
  "description": "p.f4",
  "stars": "#repo-stars-counter-star",
};

Try it: Extract from GitHub - Deno Repo

Example 3: Extract Stack Overflow Question

Get question title, votes, and tags:

const extract = {
  "question": "h1 a",
  "votes": ".js-vote-count",
  "tags": ".post-tag",
};

Try it: Extract from Stack Overflow

Example 4: Extract News Headlines

Get all headlines from a news site (returns an array):

const extract = {
  "headlines": "h2 a",
  "timestamps": "time",
};

Try it: Extract from Hacker News

Example 5: Extract HTML Instead of Text

Get the actual HTML markup of specific elements:

Try it: Extract HTML from Wikipedia

Example 6: Get Entire Page HTML

No extraction, just fetch the raw HTML:

Try it: Get raw HTML from example.com

Example 7: Use Raw Proxy

Bypass CORS and fetch any page:

Try it: Proxy example.com

Example 8: Extract Reddit Post Info

const extract = {
  "title": "h1",
  "author": "[data-testid='post-author']",
  "upvotes": "[data-testid='vote-button-up']",
};

Try it: Extract from Reddit

πŸ”§ Response Format

Single Element

If a selector matches one element, returns a string:

{
  "title": "Deno - A modern runtime for JavaScript and TypeScript"
}

Multiple Elements

If a selector matches multiple elements, returns an array:

{
  "headlines": [
    "First headline",
    "Second headline",
    "Third headline"
  ]
}

No Match

If a selector matches no elements, returns an empty string:

{
  "missing": ""
}

πŸ› οΈ Building Extract URLs

JavaScript Helper Function

function buildExtractUrl(baseUrl, from, extract) {
  const params = new URLSearchParams({
    from: from,
    extract: JSON.stringify(extract),
  });
  return `${baseUrl}?${params.toString()}`;
}

// Usage
const url = buildExtractUrl(
  "http://localhost:8000",
  "https://github.com/denoland/deno",
  { stars: "#repo-stars-counter-star" },
);

Command Line (curl)

# Extract title from Wikipedia
curl "http://localhost:8000/?from=https://en.wikipedia.org/wiki/Deno_(software)&extract=%7B%22title%22%3A%22h1%22%7D"

# Get raw HTML
curl "http://localhost:8000/raw?from=https://example.com"

πŸ“š Documentation

For complete documentation, see the specs/ directory:

🀝 Contributing

This project follows Specification-Driven Development (SDD). Please review the specs before contributing.

πŸ“„ License

Created by Sten Hougaard, March 2018. @netsi1964

About

A service which has an endpoint to extract data from an URL using CSS selectors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •