1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Not a Bug Some Unicode characters not recognized

Discussion in 'Resolved Bug Reports' started by Omar Bazavilvazo, Jan 2, 2015.

  1. Hi!

    I have on my old SMF this thread:
    http://hablajapones.org/foro/preguntas-comentarios/kanjis-suplementarios/new/#new
    upload_2015-1-2_0-13-26.png

    But I create the same thread on XenForo, and I can't post it.
    upload_2015-1-2_0-12-31.png

    The caracters are from here:
    http://www.i18nguy.com/unicode/supplementary-test.html
    upload_2015-1-2_0-12-49.png


    Something I noticed is that they are 4-bytes UTF-8. Since MySQL utf8 datatype only supports up to 3 bytes, I think XenForo is not supporting whole 1-4 UTF-8 spectre, only BMP (as I read on other threads).

    I already modified the DB (utf8mb4), and inserted those characters directly into mysql and I can read/visualize those kanjis in XenForo
    upload_2015-1-2_0-17-14.png

    So it seems before you try to store the data in the database, you "clean" the input from the text fields.

    1) Any plans for supporting utf8mb4 on MySQL 5.5.3+?
    2) Any easy way for me to change something in Xenforo that will allow me to post and save utf8 of 4 bytes, now that my database allows it?

    Thanks!
     
    tyteen4a03 likes this.
  2. cclaerhout

    cclaerhout Well-Known Member

    @Omar Bazavilvazo
    You can modify the class "XenForo_Input" and the following function:
    PHP:
      /**
        * Cleans invalid characters out of a string, such as nulls, nbsp, \r, etc.
        * Characters may not strictly be invalid, but can cause confusion/bugs.
        *
        * @param string $string
        *
        * @return string
        */
       
    public static function cleanString($string)
       {
         
    // only cover the BMP as MySQL only supports that
         
    $string preg_replace('/[\xF0-\xF7].../'''$string);
         return 
    strtr(strval($string), self::$_strClean);
       }
    Regex info

    Since there are less than 100 additional characters, you could let the regex without modifing it and just replacing these characters by their unicode before the regex and once the regex is completed get back them:

    Example:
    PHP:
      public static function cleanString($string)
       {
         
    $string MyClass_Helper_ExtraHanzi::encodeExtraHanzi($string);    

         
    // only cover the BMP as MySQL only supports that
         
    $string preg_replace('/[\xF0-\xF7].../'''$string);

         
    $string MyClass_Helper_ExtraHanzi::decodeExtraHanzi($string);    

         return 
    strtr(strval($string), self::$_strClean);
       }
    Then use this kind of helper:
    PHP:
    <?php

    class MyClass_Helper_ExtraHanzi
    {
       protected static 
    $_extraHanziUnicodeTable = array(
         
    '2070E','20731','20779','20C53','20C78','20C96','20CCF','20CD5','20D15','20D7C',
         
    '20D7F','20E0E','20E0F','20E77','20E9D','20EA2','20ED7','20EF9','20EFA','20F2D',
         
    '20F2E','20F4C','20FB4','20FBC','20FEA','2105C','2106F','21075','21076','2107B',
         
    '210C1','210C9','211D9','220C7','227B5','22AD5','22B43','22BCA','22C51','22C55',
         
    '22CC2','22D08','22D4C','22D67','22EB3','23CB7','244D3','24DB8','24DEA','2512B',
         
    '26258','267CC','269F2','269FA','27A3E','2815D','28207','282E2','28CCA','28CCD',
         
    '28CD2','29D98');
      
       protected static 
    $_extraHanziCharactersReplacementTable;
       protected static 
    $_extraHanziCharactersCharsTable;

       public static function 
    getExtraHanziRemplacementTable()
       {
         if(!
    self::$_extraHanziCharactersReplacementTable)
         {
           foreach(
    self::$_extraHanziUnicodeTable as $v)
           {
             
    self::$_extraHanziCharactersReplacementTable[] = '{u:'.$v.'}';
           }
         }
        
         return 
    self::$_extraHanziCharactersReplacementTable;
       }
      
       public static function 
    getExtraHanziCharsTable()
       {
         if(!
    self::$_extraHanziCharactersCharsTable)
         {
           foreach(
    self::$_extraHanziUnicodeTable as $v)
           {
             
    self::$_extraHanziCharactersCharsTable[] = html_entity_decode("&#x{$v};");
           }
         }
        
         return 
    self::$_extraHanziCharactersCharsTable;
       }

       public static function 
    encodeExtraHanzi($string)
       {
         
    $extraHanziChars self::getExtraHanziCharsTable();
         
    $extraReplacements self::getExtraHanziRemplacementTable();

         return 
    str_replace($extraHanziChars$extraReplacements$string);
       }


       public static function 
    decodeExtraHanzi($string)
       {
         
    $extraHanziChars self::getExtraHanziCharsTable();
         
    $extraReplacements self::getExtraHanziRemplacementTable();

         return 
    str_replace($extraReplacements$extraHanziChars$string);
       }  
    }
     
    Last edited: Jan 2, 2015
    Mr. Goodie2Shoes, tyteen4a03 and Xon like this.
  3. Mike

    Mike XenForo Developer Staff Member

    As noted, XF only supports the BMP so removing characters outside it is expected. The above post does point to the area doing it (on new input).
     
  4. @cclaerhout thanks for the information! Very helpful, now I know where to look.

    @Mike any plans for official full support of utf8mb4 and 4bytes in XF2 or is not in plans?
     
  5. Xon

    Xon Well-Known Member

    @Omar Bazavilvazo utf8mb4 is MySQL's hack to workaround their original utf8 encoding didn't actually support the full utf8 set. As far as I know, php's utf8 implementation handles 4 byte utf8 characters properly.

    And since XF has to support old versions of MySQL (ie anything less than MySQL 5.5), I don't see this changing any time soon.
     
  6. Yeah no problem with php but xenforo sanitizes/preprocesses inputs before inserting in MySQL to don't break it, so only 3 bytes Unicode are stored.

    Now that MySQL supports 4bytes too that's my question if xenforo will process 4bytes too someday
     
    Last edited: Jan 3, 2015
  7. tyteen4a03

    tyteen4a03 Well-Known Member

    As a potential Cantonese forum owner, this issue worries me as about 1/5 of them are used in normal Cantonese conversations, with lots of them being interjections.

    @cclaerhout , I would appreciate it if you made an addon for this workaround.
     
  8. Xon

    Xon Well-Known Member

    XenForo_Input isn't extendable without using something like @Yoskaldyr 's CMF_Core addon or direct code edits.
     
  9. tyteen4a03

    tyteen4a03 Well-Known Member

    Depending on that addon is fine for me.
     
  10. cclaerhout

    cclaerhout Well-Known Member

    I've had a look, the XenForo_Input class can be extended using CMF, but the extended function (cleanString) doesn't process exactly the same data: the original one is still called one step before (caller: XenForo_ControllerHelper_Editor::convertEditorHtmlToBbCode). Steps to reproduce: a simple dump of $string variable in both original and extended functions ; write a message using one of these characters.

    @Yoskaldyr
    I added as an attachment the needed files to reproduce the steps if you are interested.

    @tyteen4a03
    Do you know how much characters from the HKSCS are coded on 4bytes? (I didn't find this information)

    About your question of an addon:
    1. I will not mess with sql table encoding settings using an addon installer
    2. If you want the mb4 characters to be directly coded in the db, you will have to do like @Omar Bazavilvazo did: targeted which tables must be modified
    3. If you don't want, it's possible to transform the mb4 characters using a place holder in the text with their unicode data: the system could be like a Bb Code (search the place holder, use it's unicode data to display the mb4 character). This kind of trick is easier than modifying all needed tables, but it's less clean too: if you change of forum software, you will have to make an interface to decode the place holders as well
    Characters conversion table: pros & cons
    Depending on the number of 4 bytes characters, the above characters conversion map will or will not be a possible solution. If like in the example above, there's less than 100, it's not really a problem: a search & replace (using the php str_replace function) should not impact performance, but with a table of several thousands, that would not be the same thing. If your db table encoding supports them, then why not, the process will only be done on saving, but the solution of the place holder must be forgotten: it would consume too much resource for the server to decode them in live.

    Another (bad) solution would be to delete the regex:
    PHP:
    $string preg_replace('/[\xF0-\xF7].../'''$string);
    But this regex is only there for security reasons (with a simple explanation). So deleting it is not the greatest idea. It could be extended, but this would require to know the range of the additional characters for a particular language and this range should be limited to avoid any exploit. Which means the solution of the characters conversion table seems quite attractive.


    In the archive, the class for the characters table can be found here: Sedo_ExtraHanzi_Helper_Characters ; it's just using the characters unicode of the link Omar Bazavilvazo gave us above. To use it, see the class Sedo_ExtraHanzi_XenForo_Input that would supposed to extend the XenForo original input class, but that doesn't work (see problem description above). But the complementary code can still be still manually added the original class: XenForo_Input.

    Some other interesting documentation on the subjet:
     

    Attached Files:

    Last edited: Jan 4, 2015
    Xon likes this.
  11. tyteen4a03

    tyteen4a03 Well-Known Member

    I think they're all 4byte?

    For me, I have no existing data as I'll (probably) be opening a new forum.
     
  12. Thanks @cclaerhout for the file!

    I never thought this will lead to a long interesting discussion

    The characters I posted are just a example I grabbed to reproduce the error (or limitation). 4bytes Unicode characters are a lot more so is not very practical to try to identify all of them.

    Thanks for the link I will check it later.

    Still an official answer from @Mike or someone from staff would be interesting too see if they consider this a priority or not at all for the time being.
     
  13. Chris D

    Chris D XenForo Developer Staff Member

    Bug reports are not the best medium for discussions suggestions and a suggestion may already exist for this. Specifically with regards to emoji support I think.

    We haven't exactly discussed this yet so whether it's something we would consider for XF2 I don't know. But support for it in XF1 is unlikely.
     
    Dinh Thanh likes this.
  14. cclaerhout

    cclaerhout Well-Known Member

    Here's the official hkscs-2008 mapping list to utf8. There are about 5000 entries (including basic symbols) and 2700 seems to be out of the bmp (the ones with an unicode of 5 characters). That makes quite a lot. I can update the conversion table, so you can check by yourself the performance.

    Edit:
    @Omar Bazavilvazo
    I've checked and all the demo characters from your link are included in the hkscs set.
     
    Last edited: Jan 4, 2015
    Xon likes this.
  15. cclaerhout

    cclaerhout Well-Known Member

    @tyteen4a03
    Class updated with some of the hkscs 2008 chars (Github), don't forget your database encoding still must be modified. I let @Omar Bazavilvazo explain how he did it.
     
    Xon and tyteen4a03 like this.
  16. Yoskaldyr

    Yoskaldyr Well-Known Member

    No, in the call stack XenForo_Input is a virtual proxy class extended from Sedo_ExtraHanzi_XenForo_Input and XFProxy_XenForo_Input (as copy from original XenForo_Input)
     
  17. RobinHood

    RobinHood Well-Known Member

    Some of you might find the article in my latest post in this thread interesting.

    I figured it might be better discussed there instead of here as it's not a bug report.
     

Share This Page