Skip to content
Tom Wiesing edited this page Jul 15, 2019 · 21 revisions

Harvest is the format that the MathWebSearch accepts as crawled data. It is an extension of MathML which introduces a few tags and attributes to specify meta-information about the crawled Content MathML data.

The introduced tags are harvest, expr and data, all defined in the mws namespace. The root element of MWSHarvest is a mws::harvest. Its children are mws:expr and mws:data nodes.

mws:expr nodes contain the actual ContentMathML and have the following attributes:

  • url specifies the URL+UUID of the m:math from which the content was extracted
  • data_id specifies the id of a mws:data node previously defined in this document. The respective data will be associated with this expression.

mws:data nodes contain arbitrary XML data and have the following attributes:

  • data_id specifies an unique identifier within this XML document.

Note that mws:data has been introduced in version 2 of the mws:harvest, but it is backward compatible.

An example harvest document is provided here:

<?xml version="1.0"?>
<mws:harvest xmlns:mws="http://search.mathweb.org/ns" xmlns:m="http://www.w3.org/1998/Math/MathML">
    <mws:data mws:data_id="foo">
       <!-- in principal arbitrary XML data -->
    </mws:data>
    <mws:expr url="http://math.example.org/article123456#e123456" mws:data_id="foo">
        <m:apply>
            <m:eq/>
            <m:apply>
                <m:apply>
                    <m:csymbol cd="ambiguous">subscript</m:csymbol>
                    <m:limit/>
                    <m:apply>
                        <m:ci>&#x2192;</m:ci>
                        <m:ci>x</m:ci>
                        <m:cn>0</m:cn>
                    </m:apply>
                </m:apply>
                <m:apply>
                    <m:divide/>
                    <m:cn>1</m:cn>
                    <m:apply>
                        <m:csymbol cd="ambiguous">superscript</m:csymbol>
                        <m:ci>x</m:ci>
                        <m:cn>2</m:cn>
                    </m:apply>
                </m:apply>
            </m:apply>
            <m:infinity/>
        </m:apply>
    </mws:expr>
    <!--More mws:data and mws:expr nodes-->
</mws:harvest>

This specifies that the data contained in the mws:expr was extracted from an m:math node found in the document ​http://math.example.org/article123456 with id e123456.

The mws:data node can in principle contain any kind of XML Data. However, to be usable within an appliance, and in particular to reconstruct substitutions returned by MathWebSearch, it should correspond to a single document and contain the following data structure:

<mws:data mws:data_id="foo">
  <id>foo</id><!-- the full url to the document represented, should be the same as id -->
  <text>Hello world math1</text><!-- the text contained in the document with math substituted by ids -->

  <!-- ids of all math elements that have been replaced above -->
  <math local_id="1"><!-- raw text representing source of the math element --></math>
  <math local_id="2"><!-- raw text representing source of the math element --></math>
</mws:data>

Dummy Implementation

Below is some pseudo code to generate a harvest file for some set of MathML formulae contained within a specific url. This implementation requires a document with some MathML (containing both Content and Presentation).

  • uuid: Some (globally) unique identifier for this harvest. Usually a constantly increasing integer.
  • id: Id of the Document in question, usually its' URI
  • text: Text contained within the document, with formulae substituted with the string "math" + id. Only used for TemaSearch, may be empty otherwise.
  • formulae: Set of formulae within the document. Each have the following properties:
    • id: Id of the formula, unique within the document.
    • math: HTML-escaped Presentation + Content MathML Representation of the Formula. Usually starts with something along the lines of <math>. In addition to custom xml namespaces, it may use the "mws" and "m" namespaces. Presentation and Content should be linked with "xref" attributes.
    • cmml: Content MathML-Node of the math element above. This will be indexed by MathWebSearch. It should not be HTML-escaped.
<?xml version="1.0"?>
<mws:harvest xmlns:mws="http://search.mathweb.org/ns" xmlns:m="http://www.w3.org/1998/Math/MathML">
  <mws:data id="{uuid}">
    <id>{id}</id>
    <text>{text}</text>
    <metadata></metadata>
    {% for f in formulae %}
    <math local_id="{f.id}">{f.math}</math>
    {% endfor %}
  </mws:data>
  {% for f in formulae %}
  <mws:expr url="{f.id}" mws:data_id="{uuid}">{f.cmml}</mws:expr>
  {% endfor %}
</mws:harvest>

Clone this wiki locally