-
Notifications
You must be signed in to change notification settings - Fork 12
MWS Harvests
Harvest is the format that the MathWebSearch accepts as crawled data. It is an extension of MathML which introduces a few tags and attributes to specify meta-information about the crawled Content MathML data.
The introduced tags are harvest, expr and data, all defined in the mws namespace. The root element of MWSHarvest is a mws::harvest. Its children are mws:expr and mws:data nodes.
mws:expr nodes contain the actual ContentMathML and have the following attributes:
-
urlspecifies the URL+UUID of the m:math from which the content was extracted -
data_idspecifies theidof amws:datanode previously defined in this document. The respective data will be associated with this expression.
mws:data nodes contain arbitrary XML data and have the following attributes:
-
data_idspecifies an unique identifier within this XML document.
Note that mws:data has been introduced in version 2 of the mws:harvest, but it is backward compatible.
An example harvest document is provided here:
<?xml version="1.0"?>
<mws:harvest xmlns:mws="http://search.mathweb.org/ns" xmlns:m="http://www.w3.org/1998/Math/MathML">
<mws:data mws:data_id="foo">
<!-- in principal arbitrary XML data -->
</mws:data>
<mws:expr url="http://math.example.org/article123456#e123456" mws:data_id="foo">
<m:apply>
<m:eq/>
<m:apply>
<m:apply>
<m:csymbol cd="ambiguous">subscript</m:csymbol>
<m:limit/>
<m:apply>
<m:ci>→</m:ci>
<m:ci>x</m:ci>
<m:cn>0</m:cn>
</m:apply>
</m:apply>
<m:apply>
<m:divide/>
<m:cn>1</m:cn>
<m:apply>
<m:csymbol cd="ambiguous">superscript</m:csymbol>
<m:ci>x</m:ci>
<m:cn>2</m:cn>
</m:apply>
</m:apply>
</m:apply>
<m:infinity/>
</m:apply>
</mws:expr>
<!--More mws:data and mws:expr nodes-->
</mws:harvest>This specifies that the data contained in the mws:expr was extracted from an m:math node found in the document http://math.example.org/article123456 with id e123456.
The mws:data node can in principle contain any kind of XML Data.
However, to be usable within an appliance, and in particular to reconstruct substitutions returned by MathWebSearch, it should correspond to a single document and contain the following data structure:
<mws:data mws:data_id="foo">
<id>foo</id><!-- the full url to the document represented, should be the same as id -->
<text>Hello world math1</text><!-- the text contained in the document with math substituted by ids -->
<!-- ids of all math elements that have been replaced above -->
<math local_id="1"><!-- raw text representing source of the math element --></math>
<math local_id="2"><!-- raw text representing source of the math element --></math>
</mws:data>