ibnesayeed
Well-known member
We have upgraded our forum to 1.5 yesterday. We liked the built-in thread tagging system (one less add-on). However, they way tag URLs are constructed are not pleasant for the language of our forum. We have an Urdu forum, which generates strange and meaning-less URLs.
I think the same romanization and stripping non-romanized characters logic is used that is used here. If so then in my opinion it's not very good. In Urdu, there are many letters that have no replacement Roman letter defined, so they all get stripped, and those that have a replacement does not always make sense. As a result we often get one or two letter canonical representations of fairly average length words, and sometimes zero length. I came to observe that collisions are resolved by appending numbers and blank ones use IDs instead, but this all is so inconsistent and unpredictable and not very SEO friendly. Why not just use URL encoding instead?
Here are a few canonicalization examples of the used logic to illustrate the issue:
"اپڈیٹ" (update) and "ڈیٹا" (data) => "a"
"پگڑی" (Urdu word for turban) => ""
I think the same romanization and stripping non-romanized characters logic is used that is used here. If so then in my opinion it's not very good. In Urdu, there are many letters that have no replacement Roman letter defined, so they all get stripped, and those that have a replacement does not always make sense. As a result we often get one or two letter canonical representations of fairly average length words, and sometimes zero length. I came to observe that collisions are resolved by appending numbers and blank ones use IDs instead, but this all is so inconsistent and unpredictable and not very SEO friendly. Why not just use URL encoding instead?
Here are a few canonicalization examples of the used logic to illustrate the issue:
"اپڈیٹ" (update) and "ڈیٹا" (data) => "a"
"پگڑی" (Urdu word for turban) => ""