extract-content

A lightweight web scraping service that extracts data from any website using CSS selectors. Available in both Node.js and Deno versions.

🚀 Quick Start

Deno Version (Recommended)

# Start the server
deno task start

# Or with hot reload for development
deno task dev

# Run tests
deno test --allow-net --allow-env

Node.js Version (Legacy)

npm start

📖 Resources

Interactive Demo: Try it on CodePen
Tutorial Series: Building a content extraction endpoint
Example Pen: See it in action

📡 API Endpoints

`GET /` - Extract Text Content

Extracts text content from HTML elements using CSS selectors.

Parameters:

Parameter	Description	Required
`from`	URL to fetch content from	Yes
`extract`	JSON object mapping names to selectors	Yes

Response: JSON object with extracted text

`GET /html` - Extract HTML Content

Extracts HTML markup or returns raw HTML from a page.

Parameters:

Parameter	Description	Required
`from`	URL to fetch content from	Yes
`extract`	JSON object mapping names to selectors	No

Response: JSON object with extracted HTML (or { html: rawHtml } if no extract parameter)

`GET /raw` - Raw Proxy

Returns the raw HTML from the target URL (acts as a simple proxy).

Parameters:

Parameter	Description	Required
`from`	URL to fetch content from	Yes

Response: Raw HTML content

💡 Examples

All examples below use localhost:8000. Start the server first with deno task dev or deno task start.

Note: The extract parameter must be URL-encoded. Use encodeURIComponent() in JavaScript or any URL encoder.

Example 1: Extract Wikipedia Article Title and First Paragraph

Extract the title and introduction from a Wikipedia article:

// What we want to extract
const extract = {
  "title": "h1",
  "intro": ".mw-parser-output > p",
};

// URL-encode it
const encoded = encodeURIComponent(JSON.stringify(extract));
// Result: %7B%22title%22%3A%22h1%22%2C%22intro%22%3A%22.mw-parser-output%20%3E%20p%22%7D

Try it: Extract from Wikipedia - Deno

Example 2: Extract GitHub Repository Info

Get repository name, description, and star count:

const extract = {
  "repoName": "h1 strong a",
  "description": "p.f4",
  "stars": "#repo-stars-counter-star",
};

Try it: Extract from GitHub - Deno Repo

Example 3: Extract Stack Overflow Question

Get question title, votes, and tags:

const extract = {
  "question": "h1 a",
  "votes": ".js-vote-count",
  "tags": ".post-tag",
};

Try it: Extract from Stack Overflow

Example 4: Extract News Headlines

Get all headlines from a news site (returns an array):

const extract = {
  "headlines": "h2 a",
  "timestamps": "time",
};

Try it: Extract from Hacker News

Example 5: Extract HTML Instead of Text

Get the actual HTML markup of specific elements:

Try it: Extract HTML from Wikipedia

Example 6: Get Entire Page HTML

No extraction, just fetch the raw HTML:

Try it: Get raw HTML from example.com

Example 7: Use Raw Proxy

Bypass CORS and fetch any page:

Try it: Proxy example.com

Example 8: Extract Reddit Post Info

const extract = {
  "title": "h1",
  "author": "[data-testid='post-author']",
  "upvotes": "[data-testid='vote-button-up']",
};

Try it: Extract from Reddit

🔧 Response Format

Single Element

If a selector matches one element, returns a string:

{
  "title": "Deno - A modern runtime for JavaScript and TypeScript"
}

Multiple Elements

If a selector matches multiple elements, returns an array:

{
  "headlines": [
    "First headline",
    "Second headline",
    "Third headline"
  ]
}

No Match

If a selector matches no elements, returns an empty string:

{
  "missing": ""
}

🛠️ Building Extract URLs

JavaScript Helper Function

function buildExtractUrl(baseUrl, from, extract) {
  const params = new URLSearchParams({
    from: from,
    extract: JSON.stringify(extract),
  });
  return `${baseUrl}?${params.toString()}`;
}

// Usage
const url = buildExtractUrl(
  "http://localhost:8000",
  "https://github.com/denoland/deno",
  { stars: "#repo-stars-counter-star" },
);

Command Line (curl)

# Extract title from Wikipedia
curl "http://localhost:8000/?from=https://en.wikipedia.org/wiki/Deno_(software)&extract=%7B%22title%22%3A%22h1%22%7D"

# Get raw HTML
curl "http://localhost:8000/raw?from=https://example.com"

📚 Documentation

For complete documentation, see the specs/ directory:

API Specification - Complete API documentation
Architecture - Technical architecture
Deployment - Deployment guide
Testing - Testing strategy

🤝 Contributing

This project follows Specification-Driven Development (SDD). Please review the specs before contributing.

📄 License

Created by Sten Hougaard, March 2018. @netsi1964

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
.vscode		.vscode
modules		modules
routes		routes
specs		specs
src		src
tests		tests
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
app.js		app.js
codepen-example.html		codepen-example.html
deno.json		deno.json
exampleRequest.html		exampleRequest.html
main.ts		main.ts
package-lock.json		package-lock.json
package.json		package.json
review.md		review.md
search-emoji-codepen-example.html		search-emoji-codepen-example.html
tutorial.html		tutorial.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

extract-content

🚀 Quick Start

Deno Version (Recommended)

Node.js Version (Legacy)

📖 Resources

📡 API Endpoints

`GET /` - Extract Text Content

`GET /html` - Extract HTML Content

`GET /raw` - Raw Proxy

💡 Examples

Example 1: Extract Wikipedia Article Title and First Paragraph

Example 2: Extract GitHub Repository Info

Example 3: Extract Stack Overflow Question

Example 4: Extract News Headlines

Example 5: Extract HTML Instead of Text

Example 6: Get Entire Page HTML

Example 7: Use Raw Proxy

Example 8: Extract Reddit Post Info

🔧 Response Format

Single Element

Multiple Elements

No Match

🛠️ Building Extract URLs

JavaScript Helper Function

Command Line (curl)

📚 Documentation

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

netsi1964/extract-content

Folders and files

Latest commit

History

Repository files navigation

extract-content

🚀 Quick Start

Deno Version (Recommended)

Node.js Version (Legacy)

📖 Resources

📡 API Endpoints

GET / - Extract Text Content

GET /html - Extract HTML Content

GET /raw - Raw Proxy

💡 Examples

Example 1: Extract Wikipedia Article Title and First Paragraph

Example 2: Extract GitHub Repository Info

Example 3: Extract Stack Overflow Question

Example 4: Extract News Headlines

Example 5: Extract HTML Instead of Text

Example 6: Get Entire Page HTML

Example 7: Use Raw Proxy

Example 8: Extract Reddit Post Info

🔧 Response Format

Single Element

Multiple Elements

No Match

🛠️ Building Extract URLs

JavaScript Helper Function

Command Line (curl)

📚 Documentation

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`GET /` - Extract Text Content

`GET /html` - Extract HTML Content

`GET /raw` - Raw Proxy

Packages