holtzermann17/planetmath-docs

PlanetMath metadata

Closed this issue · 11 comments

I am reindexing PlanetMath.org with NNexus today, trying to get all issues sorted out.
However, the metadata still seems shaky and I am not sure how to index it best. Here is a good example that summarizes most of my concerns:

http://planetmath.org/isoscelestriangle

We see the following metadata:

<div class="ltx_rdf" property="dct:identifier" content="IsoscelesTriangle"/>
<div class="ltx_rdf" property="dct:created" datatype="xsd:date" content="2013-03-21 12:22:02"/>
<div class="ltx_rdf" property="dct:modified" datatype="xsd:date" content="2013-03-21 12:22:02"/>
<div class="ltx_rdf" resource="pmuser:drini" property="pm:owner"/>
<div class="ltx_rdf" resource="pmuser:drini" property="pm:modifier"/>
<div class="ltx_rdf" property="dct:title" content="isosceles triangle"/>
<div class="ltx_rdf" property="dct:hasVersion" content="16"/>
<div class="ltx_rdf" property="pm:privacy" datatype="xsd:integer" content="1"/>
<div class="ltx_rdf" resource="pmuser:drini" property="dct:creator"/>
<div class="ltx_rdf" property="dct:type" content="Definition"/>
<div class="ltx_rdf" resource="msc:51-00" property="dct:subject"/>
<div class="ltx_rdf" resource="msc:49J20" property="dct:subject"/>
<div class="ltx_rdf" resource="msc:49J30" property="dct:subject"/>
<div class="ltx_rdf" resource="msc:49-01" property="dct:subject"/>
<div class="ltx_rdf" about="pmconcept:IsoscelesTriangle" property="pm:synonym" content="isosceles"/>
<div class="ltx_rdf" resource="pmarticle:Triangle" property="pm:related"/>
<div class="ltx_rdf" resource="pmarticle:RightTriangle" property="pm:related"/>
<div class="ltx_rdf" resource="pmarticle:EquilateralTriangle" property="pm:related"/>
<div class="ltx_rdf" resource="pmarticle:EquivalentConditionsForTriangles" property="pm:related"/>
<div class="ltx_rdf" resource="pmarticle:EquiangularTriangle" property="pm:related"/>
<div class="ltx_rdf" resource="pmarticle:RegularTriangle" property="pm:related"/>
<div class="ltx_rdf" property="pm:defines" content="pmconcept:base angle"/>
<div class="ltx_rdf" property="pm:defines" content="pmconcept:vertex angle"/>

The article has pm:title _isosceles triangle, and provides a synonym _isosceles to that extent. However, it also pm:defines two other concepts, namely _base angle_ and _vertex angle_.

I was firstly thinking to only index articles that have pm:defines but clearly, that would omit this article, which also defines its pm:title. Then again, there are articles that really don't define anything, such as this one which I don't want to index (as expected they don't have any pm:defines). But maybe I have to live with some junk in the index...

Is this sane:

  • Index all PM encyclopedia articles, assuming their pm:title is a concept name
  • All synonyms go towards the pm:title concept
  • Additional pm:defines get indexed as separate concepts with no synonyms.

That would cover the triangle article and I will have to live with the junk from the other article, it is in any case too specific to ever get linked against. We should have a metadata-curation initiative for the PM articles at some point.

Oh, there are many examples of articles that define concepts but don't use pm:defines as they seemingly assume their pm:title will be indexed as a concept name. Here is one example.

@dginev - pm:title as defined term is indeed the legacy way of thinking about things. pm:defines is for extra definitions that aren't equivalent to the title, whereas pm:synonym is for extra terms that are equivalent to the title.

In short: Your plan in bullet points above does seem like the right thing to do.

I am working on that... another addition is that I will skip any definitions without an MSC class specified, as they cause more problems. Or should I instead link them to an arbitrary top-level class, e.g. 00-XX for general?

(there is a lot of broken metadata fields on the PM site right now btw, currently making my indexing robust to guard against them. I suspect I should then reinforce LaTeXML as well to not produce garbage metadata)

Some synonyms use TeX's math syntax to try and specify partial or entire math expressions as synonyms. I have currently updated my indexer to ignore such entries, in the long run we should convert them to MathML via LaTeXML and have them ready to be indexed in MathML. But that's a late summer task.

Rather than skipping definitions without MSC, how about creating a category
called XX-XX and assigning them to that?

It needs to be a real category, otherwise it would definitely confuse the disambiguation mechanism... But I agree that it would be nice not to lose concepts because of missing classification.

Then again the disambiguation can just have a custom rule that inspects the XX-XX and it is unambiguously marking up "no category", so... OK, I accept your suggestion.

Btw, I am currently indexing PlanetMath.org, so let me know if it bogs down the server too much - if so I will space out my requests.

I noticed it was a little slower than usual but not SO bad. It reminded me to ask Constantin about some Javascript fixes (MathHubInfo/Legacy-planetary#356).

On 18.4.13 21:00, Deyan Ginev wrote:

(there is a lot of broken metadata fields on the PM site right now
btw, currently making my indexing robust to guard against them. I
suspect I should then reinforce LaTeXML as well to not produce garbage
metadata)

Is there any thought of running a bot over the PM that fixes metadata.
If I remember correctly, then Wikipedia does something like this. A
correction of the source would have the plus that the author (or a
maintainer) can correct, if the bot gets it wrong.

Michael


Reply to this email directly or view it on GitHub
#40 (comment).


Prof. Dr. Michael Kohlhase, Office: Research 1, Room 168
Professor of Computer Science Campus Ring 1,
Jacobs University Bremen D-28759 Bremen, Germany
tel/fax: +49 421 200-3140/-493140 skype: m.kohlhase

m.kohlhase@jacobs-university.de http://kwarc.info/kohlhase

Ok, I think the scheme I have supported now has some sanity. Things that need to be addressed in the future:

  • Special characters in titles and concept names - a fraction of the articles use a variety of non-alphanumeric characters, such as slash, parentheses, semicolons. Right now I am ignoring such entries.
  • Math characters in titles and concept names - similarly, people are using TeX math markup - $,^,_ as well as some arithmetic operations. Again, I am ignoring them for now, ideally we should have MathML there.
  • What to do with the missing category information (XX-XX)

But for now I am happy, I am almost done indexing PlanetMath and am closing the ticket.