Not a bug Some Unicode characters not recognized

Hi!

I have on my old SMF this thread:
http://hablajapones.org/foro/preguntas-comentarios/kanjis-suplementarios/new/#new
upload_2015-1-2_0-13-26.webp

But I create the same thread on XenForo, and I can't post it.
upload_2015-1-2_0-12-31.webp

The caracters are from here:
http://www.i18nguy.com/unicode/supplementary-test.html
upload_2015-1-2_0-12-49.webp


Something I noticed is that they are 4-bytes UTF-8. Since MySQL utf8 datatype only supports up to 3 bytes, I think XenForo is not supporting whole 1-4 UTF-8 spectre, only BMP (as I read on other threads).

I already modified the DB (utf8mb4), and inserted those characters directly into mysql and I can read/visualize those kanjis in XenForo
upload_2015-1-2_0-17-14.webp

So it seems before you try to store the data in the database, you "clean" the input from the text fields.

1) Any plans for supporting utf8mb4 on MySQL 5.5.3+?
2) Any easy way for me to change something in Xenforo that will allow me to post and save utf8 of 4 bytes, now that my database allows it?

Thanks!
 
@Omar Bazavilvazo
You can modify the class "XenForo_Input" and the following function:
PHP:
  /**
    * Cleans invalid characters out of a string, such as nulls, nbsp, \r, etc.
    * Characters may not strictly be invalid, but can cause confusion/bugs.
    *
    * @param string $string
    *
    * @return string
    */
   public static function cleanString($string)
   {
     // only cover the BMP as MySQL only supports that
     $string = preg_replace('/[\xF0-\xF7].../', '', $string);
     return strtr(strval($string), self::$_strClean);
   }
Regex info

Since there are less than 100 additional characters, you could let the regex without modifing it and just replacing these characters by their unicode before the regex and once the regex is completed get back them:

Example:
PHP:
  public static function cleanString($string)
   {
     $string = MyClass_Helper_ExtraHanzi::encodeExtraHanzi($string);    

     // only cover the BMP as MySQL only supports that
     $string = preg_replace('/[\xF0-\xF7].../', '', $string);

     $string = MyClass_Helper_ExtraHanzi::decodeExtraHanzi($string);    

     return strtr(strval($string), self::$_strClean);
   }

Then use this kind of helper:
PHP:
<?php

class MyClass_Helper_ExtraHanzi
{
   protected static $_extraHanziUnicodeTable = array(
     '2070E','20731','20779','20C53','20C78','20C96','20CCF','20CD5','20D15','20D7C',
     '20D7F','20E0E','20E0F','20E77','20E9D','20EA2','20ED7','20EF9','20EFA','20F2D',
     '20F2E','20F4C','20FB4','20FBC','20FEA','2105C','2106F','21075','21076','2107B',
     '210C1','210C9','211D9','220C7','227B5','22AD5','22B43','22BCA','22C51','22C55',
     '22CC2','22D08','22D4C','22D67','22EB3','23CB7','244D3','24DB8','24DEA','2512B',
     '26258','267CC','269F2','269FA','27A3E','2815D','28207','282E2','28CCA','28CCD',
     '28CD2','29D98');
  
   protected static $_extraHanziCharactersReplacementTable;
   protected static $_extraHanziCharactersCharsTable;

   public static function getExtraHanziRemplacementTable()
   {
     if(!self::$_extraHanziCharactersReplacementTable)
     {
       foreach(self::$_extraHanziUnicodeTable as $v)
       {
         self::$_extraHanziCharactersReplacementTable[] = '{u:'.$v.'}';
       }
     }
    
     return self::$_extraHanziCharactersReplacementTable;
   }
  
   public static function getExtraHanziCharsTable()
   {
     if(!self::$_extraHanziCharactersCharsTable)
     {
       foreach(self::$_extraHanziUnicodeTable as $v)
       {
         self::$_extraHanziCharactersCharsTable[] = html_entity_decode("&#x{$v};");
       }
     }
    
     return self::$_extraHanziCharactersCharsTable;
   }

   public static function encodeExtraHanzi($string)
   {
     $extraHanziChars = self::getExtraHanziCharsTable();
     $extraReplacements = self::getExtraHanziRemplacementTable();

     return str_replace($extraHanziChars, $extraReplacements, $string);
   }


   public static function decodeExtraHanzi($string)
   {
     $extraHanziChars = self::getExtraHanziCharsTable();
     $extraReplacements = self::getExtraHanziRemplacementTable();

     return str_replace($extraReplacements, $extraHanziChars, $string);
   }  
}
 
Last edited:
As noted, XF only supports the BMP so removing characters outside it is expected. The above post does point to the area doing it (on new input).
 
@Omar Bazavilvazo utf8mb4 is MySQL's hack to workaround their original utf8 encoding didn't actually support the full utf8 set. As far as I know, php's utf8 implementation handles 4 byte utf8 characters properly.

And since XF has to support old versions of MySQL (ie anything less than MySQL 5.5), I don't see this changing any time soon.
 
Yeah no problem with php but xenforo sanitizes/preprocesses inputs before inserting in MySQL to don't break it, so only 3 bytes Unicode are stored.

Now that MySQL supports 4bytes too that's my question if xenforo will process 4bytes too someday
 
Last edited:
As a potential Cantonese forum owner, this issue worries me as about 1/5 of them are used in normal Cantonese conversations, with lots of them being interjections.

@cclaerhout , I would appreciate it if you made an addon for this workaround.
 
XenForo_Input isn't extendable without using something like @Yoskaldyr 's CMF_Core addon or direct code edits.

I've had a look, the XenForo_Input class can be extended using CMF, but the extended function (cleanString) doesn't process exactly the same data: the original one is still called one step before (caller: XenForo_ControllerHelper_Editor::convertEditorHtmlToBbCode). Steps to reproduce: a simple dump of $string variable in both original and extended functions ; write a message using one of these characters.

@Yoskaldyr
I added as an attachment the needed files to reproduce the steps if you are interested.

@tyteen4a03
Do you know how much characters from the HKSCS are coded on 4bytes? (I didn't find this information)

About your question of an addon:
  1. I will not mess with sql table encoding settings using an addon installer
  2. If you want the mb4 characters to be directly coded in the db, you will have to do like @Omar Bazavilvazo did: targeted which tables must be modified
  3. If you don't want, it's possible to transform the mb4 characters using a place holder in the text with their unicode data: the system could be like a Bb Code (search the place holder, use it's unicode data to display the mb4 character). This kind of trick is easier than modifying all needed tables, but it's less clean too: if you change of forum software, you will have to make an interface to decode the place holders as well
Characters conversion table: pros & cons
Depending on the number of 4 bytes characters, the above characters conversion map will or will not be a possible solution. If like in the example above, there's less than 100, it's not really a problem: a search & replace (using the php str_replace function) should not impact performance, but with a table of several thousands, that would not be the same thing. If your db table encoding supports them, then why not, the process will only be done on saving, but the solution of the place holder must be forgotten: it would consume too much resource for the server to decode them in live.

Another (bad) solution would be to delete the regex:
PHP:
$string = preg_replace('/[\xF0-\xF7].../', '', $string);
But this regex is only there for security reasons (with a simple explanation). So deleting it is not the greatest idea. It could be extended, but this would require to know the range of the additional characters for a particular language and this range should be limited to avoid any exploit. Which means the solution of the characters conversion table seems quite attractive.


In the archive, the class for the characters table can be found here: Sedo_ExtraHanzi_Helper_Characters ; it's just using the characters unicode of the link Omar Bazavilvazo gave us above. To use it, see the class Sedo_ExtraHanzi_XenForo_Input that would supposed to extend the XenForo original input class, but that doesn't work (see problem description above). But the complementary code can still be still manually added the original class: XenForo_Input.

Some other interesting documentation on the subjet:
 

Attachments

Last edited:
Thanks @cclaerhout for the file!

I never thought this will lead to a long interesting discussion

The characters I posted are just a example I grabbed to reproduce the error (or limitation). 4bytes Unicode characters are a lot more so is not very practical to try to identify all of them.

Thanks for the link I will check it later.

Still an official answer from @Mike or someone from staff would be interesting too see if they consider this a priority or not at all for the time being.
 
Bug reports are not the best medium for discussions suggestions and a suggestion may already exist for this. Specifically with regards to emoji support I think.

We haven't exactly discussed this yet so whether it's something we would consider for XF2 I don't know. But support for it in XF1 is unlikely.
 
I think they're all 4byte?

For me, I have no existing data as I'll (probably) be opening a new forum.
Here's the official hkscs-2008 mapping list to utf8. There are about 5000 entries (including basic symbols) and 2700 seems to be out of the bmp (the ones with an unicode of 5 characters). That makes quite a lot. I can update the conversion table, so you can check by yourself the performance.

Edit:
@Omar Bazavilvazo
I've checked and all the demo characters from your link are included in the hkscs set.
 
Last edited:
've had a look, the XenForo_Input class can be extended using CMF, but the extended function (cleanString) doesn't process exactly the same data: the original one is still called one step before (caller: XenForo_ControllerHelper_Editor::convertEditorHtmlToBbCode)
No, in the call stack XenForo_Input is a virtual proxy class extended from Sedo_ExtraHanzi_XenForo_Input and XFProxy_XenForo_Input (as copy from original XenForo_Input)
 
Some of you might find the article in my latest post in this thread interesting.

I figured it might be better discussed there instead of here as it's not a bug report.
 
Top Bottom