Fixed Content title in URLs: Turn nonbreaking hyphens into normal hyphens instead of removing them

Steffen · May 21, 2018

XenForo currently completely removes nonbreaking hyphens (U+2011) from URLs. For example, the title "Cortex‑A72" is turned into the URL "cortexa72". It should be "cortex-a72".

Steffen · May 21, 2018

Same for nonbreaking spaces.

(It seems like you can only create such titles programmatically but not by using a browser because the input filterer seems to strip these characters. So maybe this is "not a bug".)

Edit: Or by using the importer.

Edit 2: Since the importer doesn't strip these characters and because they might be added using the API, I think it would make sense to handle them.

Diff:

diff --git a/src/XF/Mvc/Router.php b/src/XF/Mvc/Router.php
index 49b09a8a5..aea513f3f 100644
--- a/src/XF/Mvc/Router.php
+++ b/src/XF/Mvc/Router.php
@@ -488,6 +488,9 @@ class Router
         );
         $string = strtr($string, ['"' => '', "'" => '']);
 
+        // Non-breaking space and Non-breaking hyphen
+        $string = str_replace([' ', '‑'], ' ', $string);
+
         if ($romanize)
         {
             $string = preg_replace('/[^a-zA-Z0-9_ -]/', '', $string);

(I haven't used strtr because afaik it's not safe to be used with UTF-8 needles.)

(Alternatively you could consider using $string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);).

Chris D · Jul 13, 2018

Steffen said:
XenForo currently completely removes nonbreaking hyphens (U+2011) from URLs. For example, the title "Cortex‑A72" is turned into the URL "cortexa72". It should be "cortex-a72".

I cannot actually reproduce this. I've just named a thread Cortex‑-A72 which is a non-breaking hyphen followed by a standard hyphen (if you do Ctrl+F and search for a normal hyphen, only one will be highlighted in the previous inline code). In fact, saving this post will retain it, as further evidence of us not doing anything explicit to strip that out via the input filterer.

In my example, the URL string is coming out as threads/cortex‑-a72.12 (again, only the right one is a standard hyphen).

So, this leads me to believe it isn't actually a non-breaking hyphen you're referring to, and perhaps something else.

The input filterer itself does strip a bunch of control characters out and indeed non-breaking spaces and other zero width characters including soft hyphens but it doesn't look like we attempt to strip anything else there that would appear as a hyphen and then be stripped out of a URL.

So, might need a more solid reproduction case here. Even so, not entirely sure we'd make changes here but certainly interested in seeing a reproducible example.

Steffen · Jul 13, 2018

Maybe I'll be able to have a deeper look in the coming days but what you could try: Maybe this issue only exists if the option "Romanize titles in URLs" is enabled?

Steffen · Jul 13, 2018

Btw, my "final" fix for this issue a few weeks ago was to add $string = iconv('UTF-8', 'ASCII//TRANSLIT', $string); after $string = utf8_romanize(utf8_deaccent($string));.

Chris D · Jul 13, 2018

Ah, you have romanize enabled... yes, that would be it.

Steffen said:
Btw, my "final" fix for this issue a few weeks ago was to add $string = iconv('UTF-8', 'ASCII//TRANSLIT', $string); after $string = utf8_romanize(utf8_deaccent($string));.

That seems to still strip the non-breaking hyphen in my testing.

In that case, I'm committing this change:

Diff:

Index: src/XF/Mvc/Router.php
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- src/XF/Mvc/Router.php    (date 1531486827000)
+++ src/XF/Mvc/Router.php    (date 1531487582000)
@@ -490,6 +490,8 @@
 
         if ($romanize)
         {
+            // Convert non-breaking hyphen to hyphen
+            $string = str_replace('‑', '-', $string);
             $string = preg_replace('/[^a-zA-Z0-9_ -]/', '', $string);
         }

Do you have an example of how non-breaking spaces are problematic? In my testing, these are stripped, but IMO they should be, so I don't think we need to make any changes there.

Steffen · Jul 14, 2018

Using $string = str_replace('‑', '-', $string); works fine for non-breaking hyphens.

The advantage of using $string = iconv('UTF-8', 'ASCII//TRANSLIT', $string); is that it handles other characters, too.

For example, × → x and ² → 2 (and non-breaking spaces).

Chris D said:
Do you have an example of how non-breaking spaces are problematic? In my testing, these are stripped, but IMO they should be, so I don't think we need to make any changes there.

Non-breaking spaces are used when you want to prevent a line-break that would make a sentence / title harder to read. For example, you might want to prevent a line-break in "Windows 10" or "July 14th" (it would look strange if "Windows" was the last word on one line and "10" the first word on the next line). I agree that this isn't usually something that your average forum user does but our editors use non-breaking spaces when writing headlines (which are then used as comment thread titles). I don't think that non-breaking spaces should be stripped from URLs, they should be converted to normal spaces (and finally to hyphens).

Chris D · Jul 14, 2018

I was wrong to claim nbsps should be stripped; I may have been reading my test data wrong. They should be treated the same as normal spaces, so that's no problem.

I've done some further testing and the ASCII translit seems to be working properly after all, so all sorted. Thanks.

Fixed Content title in URLs: Turn nonbreaking hyphens into normal hyphens instead of removing them

Steffen

Well-known member

Steffen

Well-known member

Chris D

XenForo developer

Steffen

Well-known member

Steffen

Well-known member

Chris D

XenForo developer

Steffen

Well-known member

Chris D

XenForo developer

Similar threads

We value your privacy