A lightweight web scraping service that extracts data from any website using CSS selectors. Available in both Node.js and Deno versions.
# Start the server
deno task start
# Or with hot reload for development
deno task dev
# Run tests
deno test --allow-net --allow-envnpm start- Interactive Demo: Try it on CodePen
- Tutorial Series: Building a content extraction endpoint
- Example Pen: See it in action
Extracts text content from HTML elements using CSS selectors.
Parameters:
| Parameter | Description | Required |
|---|---|---|
from |
URL to fetch content from | Yes |
extract |
JSON object mapping names to selectors | Yes |
Response: JSON object with extracted text
Extracts HTML markup or returns raw HTML from a page.
Parameters:
| Parameter | Description | Required |
|---|---|---|
from |
URL to fetch content from | Yes |
extract |
JSON object mapping names to selectors | No |
Response: JSON object with extracted HTML (or { html: rawHtml } if no extract parameter)
Returns the raw HTML from the target URL (acts as a simple proxy).
Parameters:
| Parameter | Description | Required |
|---|---|---|
from |
URL to fetch content from | Yes |
Response: Raw HTML content
All examples below use localhost:8000. Start the server first with deno task dev or
deno task start.
Note: The
extractparameter must be URL-encoded. UseencodeURIComponent()in JavaScript or any URL encoder.
Extract the title and introduction from a Wikipedia article:
// What we want to extract
const extract = {
"title": "h1",
"intro": ".mw-parser-output > p",
};
// URL-encode it
const encoded = encodeURIComponent(JSON.stringify(extract));
// Result: %7B%22title%22%3A%22h1%22%2C%22intro%22%3A%22.mw-parser-output%20%3E%20p%22%7DTry it: Extract from Wikipedia - Deno
Get repository name, description, and star count:
const extract = {
"repoName": "h1 strong a",
"description": "p.f4",
"stars": "#repo-stars-counter-star",
};Try it: Extract from GitHub - Deno Repo
Get question title, votes, and tags:
const extract = {
"question": "h1 a",
"votes": ".js-vote-count",
"tags": ".post-tag",
};Try it: Extract from Stack Overflow
Get all headlines from a news site (returns an array):
const extract = {
"headlines": "h2 a",
"timestamps": "time",
};Try it: Extract from Hacker News
Get the actual HTML markup of specific elements:
Try it: Extract HTML from Wikipedia
No extraction, just fetch the raw HTML:
Try it: Get raw HTML from example.com
Bypass CORS and fetch any page:
Try it: Proxy example.com
const extract = {
"title": "h1",
"author": "[data-testid='post-author']",
"upvotes": "[data-testid='vote-button-up']",
};Try it: Extract from Reddit
If a selector matches one element, returns a string:
{
"title": "Deno - A modern runtime for JavaScript and TypeScript"
}If a selector matches multiple elements, returns an array:
{
"headlines": [
"First headline",
"Second headline",
"Third headline"
]
}If a selector matches no elements, returns an empty string:
{
"missing": ""
}function buildExtractUrl(baseUrl, from, extract) {
const params = new URLSearchParams({
from: from,
extract: JSON.stringify(extract),
});
return `${baseUrl}?${params.toString()}`;
}
// Usage
const url = buildExtractUrl(
"http://localhost:8000",
"https://github.com/denoland/deno",
{ stars: "#repo-stars-counter-star" },
);# Extract title from Wikipedia
curl "http://localhost:8000/?from=https://en.wikipedia.org/wiki/Deno_(software)&extract=%7B%22title%22%3A%22h1%22%7D"
# Get raw HTML
curl "http://localhost:8000/raw?from=https://example.com"For complete documentation, see the specs/ directory:
- API Specification - Complete API documentation
- Architecture - Technical architecture
- Deployment - Deployment guide
- Testing - Testing strategy
This project follows Specification-Driven Development (SDD). Please review the specs before contributing.
Created by Sten Hougaard, March 2018. @netsi1964