As designed Thread Tag Canonicalization Not Suitable For Every Language

ibnesayeed

Well-known member
We have upgraded our forum to 1.5 yesterday. We liked the built-in thread tagging system (one less add-on). However, they way tag URLs are constructed are not pleasant for the language of our forum. We have an Urdu forum, which generates strange and meaning-less URLs.

I think the same romanization and stripping non-romanized characters logic is used that is used here. If so then in my opinion it's not very good. In Urdu, there are many letters that have no replacement Roman letter defined, so they all get stripped, and those that have a replacement does not always make sense. As a result we often get one or two letter canonical representations of fairly average length words, and sometimes zero length. I came to observe that collisions are resolved by appending numbers and blank ones use IDs instead, but this all is so inconsistent and unpredictable and not very SEO friendly. Why not just use URL encoding instead?

Here are a few canonicalization examples of the used logic to illustrate the issue:

"اپڈیٹ" (update) and "ڈیٹا" (data) => "a"
"پگڑی" (Urdu word for turban) => ""
 
We don't use non-ASCII characters in URLs in a place where they are needed for identifiers due to potential to be modified or sent through with incorrect character sets, not to mention potentially making very long encoded URLs. (The words aren't directly in the URL; they are percent encoded if represented properly.)

We attempt to convert as much as possible, though clearly it won't be possible with every language. We still ensure the URLs will always be unique and if desired, you can manually set the URL version of specific tags in the control panel.

Essentially, this is as the system has been designed and it isn't something we are planning on changing at this time.
 
We don't use non-ASCII characters in URLs in a place where they are needed for identifiers due to potential to be modified or sent through with incorrect character sets, not to mention potentially making very long encoded URLs. (The words aren't directly in the URL; they are percent encoded if represented properly.)
I completely understand the rationale behind it. However, why can't we have a URL scheme similar to many other content types such as forums, threads, resources, and media where the actual title is used along with the numeric id as the suffix. I understand that having just the tag in the URL looks a lot nicer, but having an additional number would be a good bet in places where majority of tags would render meaningless when transliterated. At least an option to choose this alternative URL scheme would be good enough. Not having the essence of the tag in the URL may hurt users as well as SEO.
We attempt to convert as much as possible, though clearly it won't be possible with every language. We still ensure the URLs will always be unique and if desired, you can manually set the URL version of specific tags in the control panel.
Unfortunately, the transliterator used is proving to be very crappy (in case of Urdu at least, Arabic and Persian are not very good either). I randomly opened three pages and examined about 150 tags in the ACP, not a single instance I could find which I can read the way it should be pronounced, not even close. Just two or three I found that I could read what they possibly meant. In many cases they were so far away from the word that I could have never guessed what would they read (probably because many letters were pruned). We have about 20K tags in our forum, so fixing them manually is just out of question. Even if we do some sort of magic to fix the existing, it would be an ongoing administrative overhead, especially because almost no tag would be correct by itself.
Essentially, this is as the system has been designed and it isn't something we are planning on changing at this time.
In my humble opinion, some design decisions that hurt a significant user base, if not changed early enough can be difficult to change later due to legacy reasons.
Can this thread be moved to the suggestion forum or would you rather prefer a new thread in suggestions?
 
An option is really going to be the only approach that can be taken. We're essentially past the point of being able to change it wholesale.

I would recommend posting it as a suggestion. If you have particular problems with the romanization process as well and they can be resolved (as in, there are rules to do it), then that's something we could consider expanding as well. (Though it's still a difficult process in general.)
 
Top Bottom