matrix-org/matrix-viewer

Use URL hash fragment anchor for message permalink, add `id` attribute of message to jump on it

bkil opened this issue · 7 comments

bkil commented

Include the Matrix event ID in the URI hash, ex:

https://archive.matrix.org/r/securemessagingapps:matrix.org/date/2023/05/30#$5cQZRtG9bsleXZI2x-s6wEDfeZ5B1nC_jEvOwpA-VdI

To make this work, we would also need to set the id attribute of each timeline message to the respective value (instead of the current data-event-id) so the browser will jump to it upon loading. You can use the :target CSS selector to highlight the matching message on the timeline with a different background and add a mark on the side as well.

If the backend for some reason would also need to access the event ID (without JavaScript) to return messages for the given date, consider adding it to both the query and the hash.

There were restrictions in former versions of HTML on the syntax of the ID, but from HTML5, it should be non-empty and can contain basically anything except whitespace:

@bkil What benefit are you trying to achieve with this? I assume you're after the permalink event scrolling into view even when JavaScript is disabled?

Please note, we're not specifically optimizing for the disabled JavaScript case but simpler and semantic is better in terms of search engines which we do care about. I don't think search engines care about scroll though 🤔

We do need to set the ?at=$abc attribute on the server backend in order to set the continuation position as you're paginating backward and forward and have to take the query parameter into account for the the server-side rendered HTML to include the selected event metadata (URL previews), semantic attributes, styles, etc.

Duplicating the event ID in the hash and ?at=$abc query parameter seems like more hassle and noise than it's worth for the disabled JavaScript scroll benefit.

bkil commented

The way how it is generated at present is actually inferior from a SEO standpoint. You now generate hundreds of pages per day (differentiated by the ID in the URI query), all containing the exact same content, but interlinked somewhat with the major difference being invisible SEO metadata and the single class hand crafted on top of the highlighted message substituting :target.

Search engines have heuristics to detect such link farms and either penalize such results or downrank the whole domain for this.

If keeping the continuation token is unavoidable, it may be included as long as it remains the same across links pointing towards the same wall of messages

bkil commented

For inspiration, this is how IndieWeb generates their online archive (backed by a git repository and a bridge between Slack-IRC-Matrix) with excellent JS & noJS accessibility and optimized for SEO:

You now generate hundreds of pages per day (differentiated by the ID in the URI query)

@bkil Ahh, that's a really interesting point (especially in terms of caching)! But this seemed to work out fine for Gitter with the same URL pattern for permalinks.

I don't think the Matrix Public Archive really qualifies for a link farm or spamdexing. Having a permalink for an item is pretty standard. You can even see this with Discourse or StackExchange sites.

As an interesting point of comparison, in the case of StackExchange questions/answers, they do duplicate the answer ID in the URL and the hash (I assume the hash is for scrolling): https://stackoverflow.com/a/482129/796832 -> https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered/482129#482129

If keeping the continuation token is unavoidable, it may be included as long as it remains the same across links pointing towards the same wall of messages

I'm not sure about the distinction you're trying to make here? Can you give an example?

bkil commented

I also know of blog engines from the 90s that generate a similar URL including a message ID in both the hash and the query. Although, all such ranking algorithms are proprietary, I'd probably allow for including a tiny bit of context around each referenced message, however including the whole day worth of chat on each separate page would definitely not fly with me.

For tree-based or thread-based blog engines, this typically boils down to referring to a thread or subtree at a time, not the whole root every time.

In search engines I've tried, those results are ranked higher which are accessible through content-unique URLs. I.e., answers are not at the top, as they have been downranked by The Algorithm.

Your linked StackOverflow example also includes this crucial piece:

<link rel="canonical" href="https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered" />

bkil commented

Drawbacks of link differentiation via the query pointing to the same page:

  • Worse client caching
  • Reduced privacy due to the server logging what each user is actually interested in vs. providing for benefit of doubt by diluting this information across a batch of messages
  • After following reply threads by clicking is implemented ( #235 #236 #247 and some more tasks ), a user could navigate a (custom, longish) timeline mostly offline through anchor links vs. requiring a server round trip for every click

Advantages:

  • Can load a new advertisement after each click
  • If message links are purely presented in the form where they point to unique batches and differentiated by anchors, search engines may discard the precise connection (they usually ignore anchors during the crawl). Including a link to this along with a link to the individual message as used by indieweb can mitigate this.

Your linked StackOverflow example also includes this crucial piece:

<link rel="canonical" href="https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered" />

Please create a new separate issue about adding this (with the SO example) ⏩ -> #251


For tree-based or thread-based blog engines, this typically boils down to referring to a thread or subtree at a time, not the whole root every time.

Reddit and Twitter are a good example of this but they are slightly different use cases since they support infinite nested levels of threads. Both include the permalink ID in the URL for reference.

Reddit even has a ?context=3 query parameter to specify the depth of surrounding messages to show. For a Matrix room, the context for a given event is just the surrounding messages (whether that be in the main timeline or thread timeline) which is what we're already showing.

It's unclear what impact on SEO that our current level of bulk surrounding messages has but it's also something we haven't measured and not something I'm particularly worried about this stage. Based on that experience with Gitter, I've seen plenty of relevant permalinks appear in Google. I'm leaning towards leaving things as-is.


In terms of the drawbacks you listed for using the ?at=$abc query parameter, we can't really get away from not including it in the URL since we want URL previews to work well.

And in terms of following a reply-chain without a page reload (as long as the messages are on the page), this isn't really relevant since we can still accommodate for that with the Hydrogen client-side JS.

Caching seems like the most impactful benefit we could get from changing but also not a total deal-breaker in my opinion with how it currently works.