Skip to content
mgladkova edited this page Nov 14, 2015 · 21 revisions

Harvest is the format that the MathWebSearch accepts as crawled data. It is an extension of MathML which introduces a few tags and attributes to specify meta-information about the crawled Content MathML data.

The introduced tags are harvest, expr and data, all defined in the mws namespace. The root element of MWSHarvest is a mws::harvest. Its children are mws:expr and mws:data nodes.

mws:expr nodes contain the actual ContentMathML and have the following attributes:

  • url specifies the URL+UUID of the m:math from which the content was extracted
  • data_id specifies the id of a mws:data node previously defined in this document. The respective data will be associated with this expression.

mws:data nodes contain arbitrary XML data and have the following attributes:

  • data_id specifies an unique identifier within this XML document.

Note that mws:data has been introduced in version 2 of the mws:harvest, but it is backward compatible.

An example harvest document is provided here:

<?xml version="1.0"?> <mws:harvest xmlns:mws="http://search.mathweb.org/ns" xmlns:m="http://www.w3.org/1998/Math/MathML"> <mws:data mws:data_id="foo"> <!-- arbitrary XML data --> </mws:data> <mws:expr url="http://math.example.org/article123456#e123456" mws:data_id="foo"> <m:apply> <m:eq/> <m:apply> <m:apply> <m:csymbol cd="ambiguous">subscript</m:csymbol> <m:limit/> <m:apply> <m:ci>&#x2192;</m:ci> <m:ci>x</m:ci> <m:cn>0</m:cn> </m:apply> </m:apply> <m:apply> <m:divide/> <m:cn>1</m:cn> <m:apply> <m:csymbol cd="ambiguous">superscript</m:csymbol> <m:ci>x</m:ci> <m:cn>2</m:cn> </m:apply> </m:apply> </m:apply> <m:infinity/> </m:apply> </mws:expr> <!-- More mws:data and mws:expr nodes --> </mws:harvest>

Clone this wiki locally