microformats/microformats2-parsing

u- parsing should always do relative URL resolution

Zegnat opened this issue · 16 comments

This question is separate from but affects #9.

Currently the parsing description for u- properties is as follows:

  • if a.u-x[href] or area.u-x[href], then get the href attribute
  • else if img.u-x[src] or audio.u-x[src] or video.u-x[src] or source.u-x[src], then get the src attribute
  • else if video.u-x[poster], then get the poster attribute
  • else if object.u-x[data], then get the data attribute
  • if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first element, if any).
  • else parse the element for the value-class-pattern. If a value is found, return it.
  • else if abbr.u-x[title], then return the title attribute
  • else if data.u-x[value] or input.u-x[value], then return the value attribute
  • else return the textContent of the element after removing all leading/trailing whitespace and nested <script> & <style> elements.

Note that URL normalisation is applied on the fifth point. Values gained from VCP, abbr, data, or input are never normalised. Is this really correct?

I ran into an issue here when implementing a partial feed. In this case I did not want the feed title to link to itself as that made no sense in relation to the surrounding HTML. Thus I opted for data instead of a:

<div class="h-feed" id="partial-feed">
  <h2 class="p-name"><data class="u-url" value="#partial-feed">Partial Feed</data></h2></div>

However, because data[value] is never normalised, I am forced to write an absolute URL in there. That will hurt portability of the code.

I also think it is bad for input based values. My reasoning here is that a microformats editor should be able to use the same parsing algorithm on the editing and on the output. But if someone writes #fragment in an input-element text field the algorithm will output #fragment, and if this is converted to an a-element on save the same algorithm will output https://example.com/#fragment.

I propose moving the 5th point (“if there is a gotten value, return the normalized absolute URL […]”) as far down the list as possible. Is there any reason why for specific elements this should not be done? I am not sure of abbr but can’t come up with any abbr.u-x use-cases either.

If people can come up with good reasons why outputs for u- properties should not always be normalised on VCP and abbr I still propose to move the data/input case to be above the normalisation step.

Use-case makes sense to me. And the change is relatively simple (move the relative URL resolution step after all the sources of retrieving the value).

From a compat perspective it shouldn't break any existing working content, because such relative URLs outside of URL attributes don't work today anyway. The only "odd" side-effect that is possible is that some existing broken u-url property values may start suddenly "working".

In addition if someone wants a non-relative-resolved "url" value from something like etc., they can just use p-url, e.g. and that way still get the old behavior (no idea why you would want that but just in case we're missing something).

I'm in favor of changing the u- parsing rule to always resolve URLs.

Another example of when you might want to use a <data> element instead of an <a> is to create a hidden link but not have the link be visible to screen readers or other consumers that are doing something with the HTML <a> semantic.

Supporting relative URL resolution on any element whose value came from a u- class seems consistent. It basically means the u- prefix tells the parser the value is a URL, whether that value comes from an <a href="" class="u-url"> or <data value="" class="u-url">, and should be resolved accordingly.

We now have a pull request jekyll/minima#160 that depends on this newer behavior so lets get at least one parser implementing this (so I'll add it to the spec as provisional) and either approvals or no objections from other implementers so we can move forward quickly (will make it official in the spec).

Since this greatly expands when relative URL resolution is done, this issue's resolution should depend on resolving #9 first.

If I’m reading both correctly, this section on the “microformats2-parsing-faq” page on the wiki deals with this same topic.

@bdesham, yes, and that FAQ item will need updating if the proposed change from this issue is accepted.

The argument made there is that URLs being “displayed and used as is” by a browser should not be normalised, so microformats parsers will match browser output. This issue argues that doing that is not what is expected from microformats parsers.

Upon reconsideration, I retract my suggestion in #10 (comment) that "this issue's resolution should depend on resolving #9 first", and commented on how to orthogonally resolve issue #9 (http://tantek.com/2018/107/t1).

As promised in #10 (comment), I’ve added PROPOSED text inline in the u-* parsing section per the proposal of this issue: http://microformats.org/wiki/index.php?title=microformats2-parsing&diff=66782&oldid=66724.

I see github.com/aaronpk’s agreement with this proposal, and would like to see at least one, preferably 2-3, more parser developer(s) explicitly agreeing as well.

We also need to see this proposed change prototyped in at least one parser to make sure it is implementable (seems like it) and to see if there are any unintended consequences.

(Originally published at: http://tantek.com/2018/107/t2/)

Additionally there is a compelling use-case for this proposal:

Permalink pages which do not link to themselves or otherwise display their own URL.

This proposal would enable the relatively (so to speak) minimal markup:

<data class="u-url" value=""></data>

To provide the u-url for the h-entry of such permalink pages, instead of having to provide an absolute URL in the value attribute.

(Originally published at: http://tantek.com/2018/107/t3/)

I am definitely 👍 on this. Will free up some time to get a working implementation in the PHP parser.

I'm fully supportive of this. I've made the change in the go library (in a separate relurl branch for now) to see what tests will break, and the only one that does is microformats-v1/hcard/email. I'll prep a PR for the tests repo to fix this once this spec change goes in.

% go test .
--- FAIL: TestSuite (0.03s)
    --- FAIL: TestSuite/microformats-v1 (0.01s)
        --- FAIL: TestSuite/microformats-v1/hcard/email (0.00s)
                testsuite_test.go:130: Parse value differs:
                         {
                          items: [
                           {
                            properties: {
                             email: [
                              "mailto:john@example.com",
                        -     "john@example.com",
                        +     "http://example.com/john@example.com",
                              "mailto:john@example.com?subject=parser-test",
                        -     "john@example.com",
                        +     "http://example.com/john@example.com",
                             ],
                             name: [
                              "John Doe",
                             ],
                            },
                            type: [
                             "h-card",
                            ],
                           },
                          ],
                          rel-urls: {
                          },
                          rels: {
                          },
                         }
FAIL
FAIL    willnorris.com/go/microformats  0.036s

the fact that only one test broke also suggests that we should add a few additional test cases to cover this change.

This proposal would enable the relatively (so to speak) minimal markup:

<data class="u-url" value=""></data>

Even simpler, you could just have <data class="u-url">. Without a value attribute, it will go to text content parsing, which will still result in an empty string, which will be resolved the same.

This has two implementations now and as far as I can see no objections, and thus should be ready to be integrated into the spec.

PR available for mf2py: microformats/mf2py#139

Something else that was brought up: empty <a> elements will throw errors on accessibility reporting tools. Yet several sites use them for hidden permalinks today. Something we can get rid off once <data> can be used!

With two parsers update and the mf2py PR sitting I feel like it should be made permanent in the spec. If there are no further objections I'll update the wiki - at the latest during IWC this coming weekend.

Resolution: proposal accepted.

No objections in above discussion, and positive opinions (👍) from several implementors on the proposal.

Proposal implementations in mf2py and microformats go parsers is sufficient to demonstrate implementability and interoperability (with updated tests cases), all as noted/linked in issue thread.

Editing specification accordingly.

(Originally published at: http://tantek.com/2018/358/t4/)