XenForo currently completely removes nonbreaking hyphens (U+2011) from URLs. For example, the title "Cortex‑A72" is turned into the URL "cortexa72". It should be "cortex-a72".
(It seems like you can only create such titles programmatically but not by using a browser because the input filterer seems to strip these characters. So maybe this is "not a bug".)
Edit: Or by using the importer.
Edit 2: Since the importer doesn't strip these characters and because they might be added using the API, I think it would make sense to handle them.
XenForo currently completely removes nonbreaking hyphens (U+2011) from URLs. For example, the title "Cortex‑A72" is turned into the URL "cortexa72". It should be "cortex-a72".
I cannot actually reproduce this. I've just named a thread Cortex‑-A72 which is a non-breaking hyphen followed by a standard hyphen (if you do Ctrl+F and search for a normal hyphen, only one will be highlighted in the previous inline code). In fact, saving this post will retain it, as further evidence of us not doing anything explicit to strip that out via the input filterer.
In my example, the URL string is coming out as threads/cortex‑-a72.12 (again, only the right one is a standard hyphen).
So, this leads me to believe it isn't actually a non-breaking hyphen you're referring to, and perhaps something else.
The input filterer itself does strip a bunch of control characters out and indeed non-breaking spaces and other zero width characters including soft hyphens but it doesn't look like we attempt to strip anything else there that would appear as a hyphen and then be stripped out of a URL.
So, might need a more solid reproduction case here. Even so, not entirely sure we'd make changes here but certainly interested in seeing a reproducible example.
Maybe I'll be able to have a deeper look in the coming days but what you could try: Maybe this issue only exists if the option "Romanize titles in URLs" is enabled?
Btw, my "final" fix for this issue a few weeks ago was to add $string = iconv('UTF-8', 'ASCII//TRANSLIT', $string); after $string = utf8_romanize(utf8_deaccent($string));.
Btw, my "final" fix for this issue a few weeks ago was to add $string = iconv('UTF-8', 'ASCII//TRANSLIT', $string); after $string = utf8_romanize(utf8_deaccent($string));.
Do you have an example of how non-breaking spaces are problematic? In my testing, these are stripped, but IMO they should be, so I don't think we need to make any changes there.
Do you have an example of how non-breaking spaces are problematic? In my testing, these are stripped, but IMO they should be, so I don't think we need to make any changes there.
Non-breaking spaces are used when you want to prevent a line-break that would make a sentence / title harder to read. For example, you might want to prevent a line-break in "Windows 10" or "July 14th" (it would look strange if "Windows" was the last word on one line and "10" the first word on the next line). I agree that this isn't usually something that your average forum user does but our editors use non-breaking spaces when writing headlines (which are then used as comment thread titles). I don't think that non-breaking spaces should be stripped from URLs, they should be converted to normal spaces (and finally to hyphens).
I was wrong to claim nbsps should be stripped; I may have been reading my test data wrong. They should be treated the same as normal spaces, so that's no problem.
I've done some further testing and the ASCII translit seems to be working properly after all, so all sorted. Thanks.