Skip to content
This repository was archived by the owner on Nov 18, 2025. It is now read-only.
Erik Rose edited this page Mar 18, 2016 · 16 revisions

Snuffy

Please scribble on this page. Answer unanswered questions. Pose more. Go nuts.

Goal

Extract the textual content of a web page.

More specifically, extract that "main attraction" content: the content the user went there to read. Equivalently, extract the content the URL stably refers to, not ads or nav or noise that change dynamically.

Our proximal use case is to provide input to a full-text indexer, likely client-side, to augment Awesome Bar results.

Prior art

  • Readability (used in Safari, FF Reader View). This is a very good start but has high standards for getting the answer "right", giving up altogether when it lacks confidence. For Awesome Bar purposes, it's more important to index something (even if it's every textual thing on the page) rather than nothing. Err on the side of extracting too much.
  • OmniWeb's full-text indexing of visited pages
  • Chrome's distraction-free-browsing mode

Indicators

Things we can look at to identify The Content:

  • A div tag with p tags in it (Readability)
  • HTML density
  • Regions that are visually largest on the page (unprecedented)
  • Regions with id≈"content"
  • Link density
  • Repeatedness of phrases (expensive) (unprecedented)
  • Stability over time (vs. changing ads etc.)

People to talk to

  • Olivier over in Content Services, Rebecca Weis, and Chuck Harmston worked on automatic page categorization.

Crazy ideas

  • Crowdsource-train the thing.

Clone this wiki locally