Skip to content
This repository was archived by the owner on Nov 18, 2025. It is now read-only.
Peter Elmers edited this page Mar 27, 2016 · 16 revisions

Please scribble on this page. Answer unanswered questions. Pose more. Go nuts.

Initial Goal

Extract the textual content of a web page.

More specifically, extract the "main attraction" content: the content the user went there to read. Equivalently, extract the content the URL stably refers to, not ads or nav or noise that change dynamically.

Our proximal use case is to provide input to a full-text indexer, likely client-side, to augment Awesome Bar results.

Other Possible Uses

  • Provide lighter page downloads for people with lower bandwidth or battery.
  • Enable meeting accessibility needs in clever new ways.
  • Feed into (or be) a categorizer of web pages so we could, for example, cluster ActivityStream entries.

Prior art

Indicators

Things we can look at to identify The Content or other metadata:

  • A div tag with p tags in it (Readability)
  • HTML density
  • Regions that are visually largest on the page (unprecedented)
  • Regions with id≈"content"
  • Link density
  • Repeatedness of phrases (expensive) (unprecedented)
  • Stability over time (vs. changing ads etc.)
  • oEmbed embeds
  • Microformats (which Firefox has a full parser for already)
  • Open Graph data

Extract

  • Content
  • Whether a page appears to be one thing (an article) or a list of things (an index, etc.)
  • Dominant colors
  • Icons (beyond favicon?)
  • Page category (recipe, comic, article, photo, video). What if Firefox had a recipe box, for instance? You could search for recipes which used certain ingredients (that you had)—or didn't (vegetarian). Really opens up the potential for semantic querying.
  • Any Next and Previous button (so we can standardize navigation)

People to talk to

  • Olivier over in Content Services, Rebecca Weis, and Chuck Harmston worked on automatic page categorization.

Crazy ideas

  • Crowdsource-train the thing.

Clone this wiki locally