Slug transliteration
Closed this issue · 77 comments
Ex: https://chanphom.com/forums/luat-choi-chan-pro.29/ from "Luật chơi Chắn Pro"
Transliteration is possible for many languages, but very difficult or impossible for a few languages (like Japanese). It would be best if there were a way to enable/disable this function; or barring that, percent encoding of unicode might be preferable as a more universally applicable solution.
Currently slugs are generated using only alphanumeric characters, replacing anything else with a hyphen. However we should support some degree of transliteration so non-Latin languages still get slugs. This is an area where I don't have much knowledge, and help would be appreciated.
What needs to be done:
- Work out a transliteration strategy (i.e. a library, or is there anything in PHP's standard library?) that supports a wide range of alphabets.
- Discuss the possibility of leaving unicode characters in slugs, for languages where transliteration is impossible. What are the problems with this, if any?
- Depending on the strategy we decide upon, consider implementing a mechanism that allows language packs to turn transliteration on/off.
- While we're here, we should also truncate long slugs to a maximum of 50 or so characters.
Maybe you can use library like this one?
https://github.com/ashtokalo/php-translit
In Spanish, the mod_rewrite replaces all Latin characters like ñ, accents, etc. with a hyphen. In order to improve the SEO would be better to rewrite the equivalent characters, for example: español ---> espanol (instead of espa-ol), corazón ---> corazon (instead of coraz-n). It can be done with a simple replacement of characters.
]/', '/[-]+/', '/<[^>]*>/'); $repl = array('', '-', ''); $url = preg_replace ($find, $repl, $url); return $url; } ?>Same could be said for Portuguese:
ã | â | á | à > a
ê | é | è | > e
í | ì | > i
õ | ô | ó | ò > o
ú | ù > u
ç > c
As I mentioned above and in #557, transliteration isn't a complete solution. There are some languages that can't be transliterated very easily, or at all.
In the case of Japanese, as I mentioned in Stumbling block 6, it would take a lot of rather sophisticated processing to come up reliable transliterations of words spelled using Chinese characters. And even the most sophisticated program will be reduced to guessing when it comes to things like names, which can use Chinese characters in nonstandard ways.
Japanese is clearly an extreme case, but even where the relationship between pronunciation and spelling tends to be more stable, there are still difficulties. To transliterate Chinese reliably, for example, you would need to provide a glossary of at least several thousand characters. So it's not always a matter of applying a few well-defined rules.
In regions where transliteration is impractical, there is a strong trend toward the use of unicode in URLs. Flarum will have to support that, or it will simply be irrelevant in those regions. At the same time, however, Flarum also needs to offer transliteration for regions that have adopted that approach.
My suggestion is:
Admins should be allowed to specify whether URLs should be transliterated or encoded. This could be implemented as an administrator setting, though it might be better still to have the question asked and answered during the installation process.
When an admin chooses the former, a library such as this one suggested by @FirestarterUA could be used to transliterate all slugs, including thread titles, tag names, and usernames. (Flarum may need to check all these items and return an error whenever any non-transliteratable text is entered. Or we could leave it up to admins to tell their users: "Don't use any Chinese characters ... or else!")
When an admin chooses the latter, all URLs are encoded appropriately, with only an absolute minimum of character replacement (e.g. hyphens in place of spaces) being performed.
Why not using same approach as Wikipedia and allow use of unicode in slugs which is supported by modern browsers and also by part of Flarum's frontend? This way many character sets would be available.
Also, currently slugs for discussions are generated on the client which is not ideal. They should be generated on the server (and stored on the database like tag slugs are).
Why not using same approach as Wikipedia and allow use of unicode in slugs which is supported by modern browsers and also by part of Flarum's frontend?
I think that would be a great solution ... I'd just like to be sure there aren't any SEO implications for admins in regions where transliteration is the accepted approach.
We might also want to truncate the slug after a certain length.
i want to mention here that for georgian language slugs are not generated at all (from this "რა კაი ფორუმი წამოვჭიმეთ!" i got "--" this slug)
and also Wikipedia approach is best for slugs
+1
@tobscure We need unicode slugs
This looks good for different languages: Cocur/Sluglify. The only problem is that it needs the language to be fully spelled out, instead of en
it needs english
, although that is probably an easy fix.
The other one I found which doesn't need a language, is Jbroadway/urlfix, although that one is more basic, I think.
Whichever is better ;)
Of the transliteration options mentioned, Slugify strikes me as the most worthy of consideration. It covers a wide range of languages out of the box, can easily customized to cover more, and is flexible when it comes to integration.
As @franzliedke said, Stringy may also be an option, especially if it can also be employed for tasks other than transliteration. One cause for concern is that it only does slugification, not true transliteration; that is, it seems to work on a fixed ruleset:
Converts the string into an URL slug. This includes replacing non-ASCII characters with their closest ASCII equivalents, removing remaining non-ASCII and non-alphanumeric characters, and replacing whitespace with $replacement.
This may not provide the best transliterations for all languages; converting ä
to a
would not work in a language where ae
is the more commonly used transliteration. A more language-specific solution would give better results vis-a-vis both SEFiness and human readability.
I'm wondering whether it would be possible to use Stringy, but insert language-specific rulesets (like the ones used by Slugify) when available. We could put the ruleset file right in the language pack, as we've done with Moment.js translations. When the admin sets the forum's slugification style to "transliteration" (as opposed to "UTF-8") Flarum would grab the ruleset for the forum's default language and slugify based on that. If the language pack is lacking a ruleset, it could fall back to standard Stringy slugification.
Would something like this be possible?
EDIT: It would be best to have Stringy treat the language-specific ruleset as overrides, so it can default to its own slugification rules when it encounters a character that's not covered in the ruleset being used. That would allow it to cope with situations involving characters not included in the ruleset for the default language ... such as a topic about Søren Kierkegaard in a French forum.
This solution would be best suited to single-language forums. Handling of thread titles (etc.) in more than one language would tend to be hit-and-miss. And in cases where a forum includes languages requiring different slugification methods ... Russian and Japanese, for example ... the admin will be forced to use UTF-8 slugs. The only way around that would be to make Flarum truly multilingual, i.e. assign a locale value to each thread.
As a Chinese speaker, I'd just want a simple option to disable slugs of posts. I don't want either transliteration or Unicode characters in the URLs. Personally I also prefer shorter URLs like example.com/d/12345 instead of example.com/d/12345-hello-world Having Unicode Chinese characters in the URL will make it horribly long and messy like https://zh.wikipedia.org/wiki/Portal:%E6%96%B0%E8%81%9E%E5%8B%95%E6%85%8B when you copy the URL from the address bar of the browser (e.g. Chrome). That is not human readable, so such slugs will be useless. I think disabling transliteration is much easier to implement and more useful to Chinese users.
Safari and Firefox are able to copy the URL in human-readable format. When I open the URL you linked above and copy it from the Safari address bar, I get this:
https://zh.wikipedia.org/wiki/Portal:新聞動態
So this should probably be considered a deficiency of Chrome ... or of your OS, perhaps. That said, a third option to disable slugs altogether shouldn't be too hard to implement, and may be wanted by enough site admins that it would be worth adding.
Hello guys :) You hear about PHP Intl Transliterator extension?
For example, you can use this snippet of code for transliterate any strings to latin characters (even japanese characters, as I know)
<?php
$rules = 'Any-Latin; Latin-ASCII; [\u0080-\uffff] remove';
echo transliterator_transliterate($rules,'Какая-то строка, которая нуждается в транслитерации');
// Kakaa-to stroka, kotoraa nuzdaetsa v transliteracii
echo transliterator_transliterate($rules,'新聞動態');
// xin wen dong tai
echo transliterator_transliterate($rules,'რა კაი ფორუმი წამოვჭიმეთ');
// ra kai porumi tsamovchimet
You can find more info about this transliterator functions in sources of Yii 2 framework, for example.
Also in page with description of Intl extension you can find message of one of php developers in which it is written one of possible solutions to transform string into the correct transliterated url:
<?php
function slugify($string) {
$string = transliterator_transliterate("Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC; [:Punctuation:] Remove; Lower();", $string);
$string = preg_replace('/[-\s]+/', '-', $string);
return trim($string, '-');
}
echo slugify("Я люблю PHP!"); // a-lublu-php
echo slugify('რა კაი ფორუმი წამოვჭიმეთ'); // ra-kʼai-porumi-tsʼamovchʼimet
echo slugify('新聞動態'); // xin-wen-dong-tai
?>
I think, it need to test on some count of strings to choose the more correct method :)
@believer-ufa Thanks for pointing it out, we'll take a look.
However, since this requires the intl extension, we probably have to use another approach (library).
@franzliedke, you already use a gd
and mysql
extensions. Why the use of this extension is the problem? On any linux OS its a problem what resolved by one command like sudo apt install php7.0-intl
.
You most likely will not be able to do a same good transliteration with some other library, since in the majority of these libraries are intended only for certain languages.
Well, you will probably agree that we can be reasonably certain that MySQL is installed everywhere. (And even if not, Flarum can not function without it.)
But yeah, I'm open to the idea. Does anybody know some place with PHP extension installation stats?
I little dont understand you. Flarum Installation guide tell to user about needs a SSH acces and PHP 5.5+ with the following extensions: mbstring, pdo_mysql, openssl, json, gd, dom, fileinfo. Its a common situation: install some PHP extensions to be able to run some framework. You just need install a one more extension for have correct transliterations in you forum)
@believer-ufa Not every Flarum admin will have the access necessary to install the extension. One of the devs' goals is to keep Flarum easy to install on shared hosting plans. Every extension added can limit the number of providers that will be able to support Flarum. I think that's why @franzliedke is asking about extension installation stats; it's a decision that can't be made too casually.
Okay, but it really nice extension :) Look at discussion on Flarum forums, one of the participants already convinced about this approach.
You can also write the code so that it does not require the presence Intl extension, but if available have used it. I think it will be the right solution that will avoid problems with bad hosting and will give us a solution to this problem.
Maybe @believer-ufa's method is a better extension, regardless of who makes it. Then composer can check if the proper extension is available and refuse to install if not. Being so dependent on an additional php module, if it's not widely installed, may hurt Flarum's ability to be widespread more than lacking this feature.
jordanjay29, you can write code what uses Intl if exist, but if not exist Flarum can work, but without nice and full language URL transliteration. Read my above comment
Well, not using the Intl extension does not mean we can't implement transliteration. There are enough libraries out there.
Still, I kinda like the idea of using Intl when it's available, and only falling back to another implementation if not.
Still, I kinda like the idea of using Intl when it's available, and only falling back to another implementation if not.
That sounds promising. 😀
We use Turkish characters in titles, but it does not look good seo link.
Example:
Title: Türkçe Deneme Asğşiçü
Link: trkce-deneme-as
Turkish characters: İ ı ş ç ğ ü
How can I solve this problem?
(I'm sorry bad english.)
@hgtucel Please see my comments in your forum thread.
I'm here to add Greek on the table too, as I pointed out on the forum
Happy to help with the mapping if needed!
Referencing a new extension by @Avatar4eg that offers a potential solution.
@jordanjay29 Does not work for me. Avatar4eg/flarum-ext-transliterator#1
@jordanjay29 Edit: it works but you have to fix manually the old forum pages URLs by renaming them twice, then the flarum-ext-transliterator
extension does its job. For the new created pages, the URLs are ok.
@HLFH I'm not the extension author, please report this bug on the extension thread at Flarum.org, or on the author's github.
@jordanjay29 Already done. Avatar4eg/flarum-ext-transliterator#1
Let me support the idea which was proposed by @yihui: there should be an option to either disable slugs completely, or set them manually. Or, better, both of them.
Forcing everyone to use machine-transliterated slugs is a huge hurt, as many languages just cannot be romanized well enough, or, at least, unambiguously. For them the result is just a confusing meaningless mess of letters.
The library you proposed seem to do only the simplest table-based substitutions. Let me comment your example:
Какая-то строка, которая нуждается в транслитерации
Kakaa-to stroka, kotoraa nuzdaetsa v transliteracii
Or maybe: kakaya, kotoraya, nuzhdaetsya. According to your nickname, you should know that Russian has a bunch of different transliteration schemes. Even the government cannot decide which one to use.
新聞動態
xin wen dong tai
But how about reading this in Japanese: shinbun dotai? Or maybe Korean reading? Unicode does not distinguish between Chinese, Japanese and Korean graphemes.
Even Latin-based scripts cannot be reliably transliterated.
Moreover, what if user wants title translation, not a transliteration in their URLs?
Unicode does not distinguish between Chinese, Japanese and Korean graphemes.
Even Latin-based scripts cannot be reliably transliterated.
Just so!
Moreover, what if user wants title translation, not a transliteration in their URLs?
That might be worth investigating as an idea for a third-party extension. For now, I think it would be sufficient if Flarum could offer a robust system to provide for both transliteration and unicode, with enough configuration options to allow admins in any region to tweak its behavior to their liking.
a robust system to provide for both transliteration and unicode
plus an option to disable slugs completely please... :)
plus an option to disable slugs completely please... :)
I don't see why that couldn't be added. Compared to everything else, it would be _easy._ 😄
Incidentally,
Having Unicode Chinese characters in the URL will make it horribly long and messy like when you copy the URL from the address bar of the browser (e.g. Chrome).
I don't experience this sort of thing when using Safari (though I have seen it when using Firefox). One would hope that the other browsers could get with the program and make it possible to copy and paste properly encoded URLs so they result would be human readable ... 🙄
EDIT: See my comment below.
Forcing everyone to use machine-transliterated slugs is a huge hurn, as many languages just cannot be romanized well enough, or, at least, unambiguously. For them the result is just a confusing meaningless mess of letters.
Interesting logic, but I believe that you create too much of an issue out of this topic. We just need the URLs, which will be have some info about conversation. After all, nothing terrible will happen if the url will be slightly incorrect. But there is better to have at least something: it allows you to add the search engines additional information about the page for better SEO optimization.
On the other hand I'm not sure what search engines do with nonsensical information (such as from a wrong transliteration) in the URL. Thanks for bringing it up, @yihui and @firegurafiku!
Scratch that ... I just copied and pasted a Google URL with Safari and ended up with a string of very non-human-readable percent encodings in it. I had been thinking that Safari fixes percent-encoded URLs when copying to the clipboard, but that doesn't seem to be the case after all.
So the issue raised by @yihui is definitely something we need to think about.
I'm not developer, but I want to share my opinion as user and webmaster. Why not copy the Wordpress (the most used cms) slug method?
Wordpress uses latin letters in lowercase, without symbols or marks, and you have the possibility to use characters from other alphabets. I also think interesting the possibility to short URL without post title (option in admin panel).
In any case, I want to show my negative opinion to method similar to Wikipedia. I'm spanish and my language uses a lot symbols and marks, and the Wikipedia URLs are annoying when you want to share Wikipedia links.
I think the url method should be simple, and complex transliteration added by extension (Wordpress has differents plugins for that).
neither transliterator_transliterate
nor Slugify is suitable for Persian language.
@sijad, if we talking about slugify, you can easily add you own rules for your language.
what if we use github issue like urls? (id only no slug no transliteration)
and then some plugins may change urls ....
what if we use github issue like urls? (id only no slug no transliteration)
and then some plugins may change urls ....
Those urls are not seo and human friendly. Your suggestion was discussed here: #1140 (comment)
if we talking about slugify, you can easily add you own rules for your language.
How about easy adding support for Chinese or Japanese?
Languages are hard and nobody should rely on automatic romanization. Instead, there should be options to disable slugs at all, or set them manually.
@believer-ufa in Persian people usually does not use diacritics in texts, so Slugify is not an option, for Persian language (and Arabic?) using unicode plus a few filters (remove diacritics, non-alphanumerics, spaces, etc) is best option.
I just make an improvement for this issue. I use something like wordpress Slug. Can handle utf8 and more ;)
@franzliedke personally I use the intl extension a lot. It allows for easy implementation of monetary values as well. And it's easy to install as well.
Putting implementation aside, any decision on whether allow UTF characters on slugs (à la Wikipedia) or not has been taken?
@johannsa I think the consensus was to make it an option?
I'm one of those who thinks that will be better use the same approach as WordPress because from my experience (more than 6 years using WordPress) is near to perfect.
The problem using intl
extension can be shared hostings, maybe some don't have that extension enabled by default.
@yagobski your commit is mostly incomplete:
- First you copy-paste WordPress code, also ignoring functions like
seems_utf8()
,get_locale()
, etc. - WordPress code is under GPL license: #1148 (comment)
@yagobski I believe the concern is that borrowing code from a GPL project will force Flarum under the GPL, and that's not a desired outcome.
Make sure any code you borrow is licensed freely (public domain) or with something compatible with MIT. GPL and other sharealikes/copyleft (any Creative Commons license with 'SA') are not compatible and will be rejected for inclusion into Flarum.
What will be the right choice when the generated slug is empty because it uses non-friendly characters:
- Option 1: use a random string and slug will be like
16-7s8eds5e68gd6se7d
- Option 2: Use only the discussion id (as @aethior suggested), the slug will be like
16-
(note that-
at the end is added by current code on empty titles).
Will be nice know which will be the right option, because both needs different code solutions.
@Zeokat I would prefer Option 2.
Yes @franzliedke that option will be our best choice. The problem is that Flarum always add char -
after the discussion id. I'm trying to locate what files also involve adding that last -
but not much luck.
At the moment only located this line involved into the ending dash:
- https://github.com/flarum/core/blob/master/views/index.blade.php#L11
- https://github.com/flarum/core/blob/master/js/forum/dist/app.js#L29873
- https://github.com/flarum/core/blob/master/js/forum/src/initializers/routes.js#L38
Handling the javascript part is the problem for me.
Hmm, the URL without slug is already understood: https://discuss.flarum.org/d/187.
That means that only the URL generation code has to be adapted.
@franzliedke Yes, slugs with discussion-id-only are already understood and also gives us some duplicated content because both urls returns "HTTP status 200" without any redirection (301) that search engines can understand. Anyway, that's another history.
I'm speaking about the lines of code that add the dash after discussion-id slug (for example, on empty slugs the autogenerated slug is https://discuss.flarum.org/d/5772-- , which seems a little ugly).
Anyway here we go: #1183
Maybe you can use library like this one? https://www.quangminhhanoi.com/dieu-hoa-daikin
Is this still planned to land on core or should we use extensions (there are 2 iirc) to solve this?
Still planned, the ticket is still open. 😉
I know it's open, but tagged as "needs-discussion". Does it still need discussion, even after #1385 (unfortunately abandoned)? Or we could use the same approach used there (Illuminate\Support\Str::slug
)?
Feel free to send a new PR that takes the changes from #1385 and applies them to the current code. The original author unfortunately did not react anymore.
Another option: https://github.com/sunrise-php/slugger
While this feature is still being discussed, just wanted to mention, that there is a working extension for transliteration for beta 8.1 supported by Friends of Flarum. It is actually a fork of this one. Thanks, people!
Symfony 4.4 offers translated slugs:
https://twitter.com/titouangalopin/status/1179436751477231617
But it requires extension intl if i'm not wrong 😑
Suggestion: put slug generator in the container with an interface for easier extension.
That way extensions like FoF Transliterator can extend it instead of listening for an event and overriding the value.
This way if another part of core uses slugs server side, it will also use the same logic.
Sadly tags is another issue because they use the client side slug()
method in utils/string
.
Okay, for anyone interested, we're finally making some progress here:
If #1975 is merged, we will have a basic transliteration implementation, based on what Laravel brings along. As discussed in detail in this issue, this is great for some, but not helpful for other languages, so more work still needs to be done.
As it doesn't make the current situation (auto-generating slugs) worse for anybody, as far as I can tell, but is an improvement for languages where transliteration makes sense (e.g. German), I think this is a solid improvement that we can make without tackling all the other things that could be done.
That doesn't mean we want to stop here, though. Improving how we cater to international audiences is very important to us (and also very eye-opening).
Based on my understanding of everything that was said in this issue, here is what I would propose as next steps / challenges. Once we agree on these, I would suggest to create separate subtickets that can be scheduled for different releases:
- Support for different slugging strategies, configurable via the admin panel - initially support the current strategy (internationalized transliteration) and the null strategy (no slugs at all)
- Enforce (and configure?) maximum length of slugs
- Allow language packs / extensions to provide custom strategies which admins can select
- An additional option for keeping Unicode characters in the slugs (research + admin setting)
If we manage to make some progress on each of these, I would be very content. 😅
Of course, one could go above and beyond with support for language-specific slugging based on the (auto-detected?) language of a discussion, but I would say that's too much for core and clearly extension territory.
I have been waiting this for years and for me @franzliedke plan seems good. I didn't tested Laravel's slugger, but if it do its work good will be at least one step forward 👍
What i want to add here, is what will happen with tags and usernames slugs, because they maybe also need transliterated or not.
Closing this as solved by flarum/issue-archive#203, extensions can now introduce custom slug drivers to allow any approach imaginable.
If a custom attribute is used to store a new transliterated / modified slug, the Saving
event can be used to set / update that attribute.