Skip to content
Tom Wiesing edited this page Jun 3, 2019 · 21 revisions

Harvest is the format that the MathWebSearch accepts as crawled data. It is an extension of MathML which introduces a few tags and attributes to specify meta-information about the crawled Content MathML data.

The introduced tags are harvest, expr and data, all defined in the mws namespace. The root element of MWSHarvest is a mws::harvest. Its children are mws:expr and mws:data nodes.

mws:expr nodes contain the actual ContentMathML and have the following attributes:

  • url specifies the URL+UUID of the m:math from which the content was extracted
  • data_id specifies the id of a mws:data node previously defined in this document. The respective data will be associated with this expression.

mws:data nodes contain arbitrary XML data and have the following attributes:

  • data_id specifies an unique identifier within this XML document.

Note that mws:data has been introduced in version 2 of the mws:harvest, but it is backward compatible.

An example harvest document is provided here:

<?xml version="1.0"?>
<mws:harvest xmlns:mws="http://search.mathweb.org/ns" xmlns:m="http://www.w3.org/1998/Math/MathML">
    <mws:data mws:data_id="foo">
       <!-- in principal arbitrary XML data -->
    </mws:data>
    <mws:expr url="http://math.example.org/article123456#e123456" mws:data_id="foo">
        <m:apply>
            <m:eq/>
            <m:apply>
                <m:apply>
                    <m:csymbol cd="ambiguous">subscript</m:csymbol>
                    <m:limit/>
                    <m:apply>
                        <m:ci>&#x2192;</m:ci>
                        <m:ci>x</m:ci>
                        <m:cn>0</m:cn>
                    </m:apply>
                </m:apply>
                <m:apply>
                    <m:divide/>
                    <m:cn>1</m:cn>
                    <m:apply>
                        <m:csymbol cd="ambiguous">superscript</m:csymbol>
                        <m:ci>x</m:ci>
                        <m:cn>2</m:cn>
                    </m:apply>
                </m:apply>
            </m:apply>
            <m:infinity/>
        </m:apply>
    </mws:expr>
    <!--More mws:data and mws:expr nodes-->
</mws:harvest>

This specifies that the data contained in the mws:expr was extracted from an m:math node found in the document ​http://math.example.org/article123456 with id e123456.

The mws:data node can in principle contain any kind of XML Data. However, to be usable within an appliance, and in particular to reconstruct substitutions returned by MathWebSearch, it should correspond to a single document and contain the following data structure:

<mws:data mws:data_id="foo">
  <id>foo</id><!-- the full url to the document represented, should be the same as id -->
  <text>Hello world math1</text><!-- the text contained in the document with math substituted by ids -->

  <!-- ids of all math elements that have been replaced above -->
  <math local_id="1"><!-- raw text representing source of the math element --></math>
  <math local_id="2"><!-- raw text representing source of the math element --></math>
</mws:data>

Clone this wiki locally