1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

True UTF-8 support - MySQL utf8mb4

Discussion in 'XenForo Suggestions' started by Xon, Jan 4, 2015.

  1. Xon

    Xon Well-Known Member

    MySQL's utf8 data type only supports 3-byte characters, not the full 4 byte characters.

    MySQL only recently as of MySQL 5.5 now supports the full UTF-8 encoding set via the table coolation type utf8mb4.

    This means ios emoji icons (and likely other mobile devices who move to offer the same support) and non-English characters are impacted.

    After derailing a support ticket, and seeing it happening to another one;
    People involved in those last threads who may be interested: @RobinHood, @Tracy Perry, @cclaerhout, @Omar Bazavilvazo @tyteen4a03
     
    Dadparvar, EQnoble, rafass and 29 others like this.
  2. cclaerhout

    cclaerhout Well-Known Member

    If your SQL version is compatible with the utf8mb4 charset, here are the steps to follow to update your board:
    1. Backup your Database and perform a test at first on a local installation
    2. #SQL# Alter your SQL tables collation ; you can use the "utf8mb4_general_ci"
      If you only want to make a small test for posts, you will also have to alter any related tables ; ie:
      Code:
      ALTER TABLE xf_post CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
      ALTER TABLE xf_search_index CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci
      
    3. #PHP# Edit the class "XenForo_Application"
      Search for:
      PHP:
              'charset' => 'utf8'
      Replace with:
      PHP:
              'charset' => 'utf8mb4'
    4. #PHP# Edit the class "XenForo_Input"
      You will have to implement a characters conversion table in the function "cleanString" to safely bypass the below regex (ref):
      PHP:
      $string preg_replace('/[\xF0-\xF7].../'''$string);

      For those interested with the HKSCS to make the life of your Cantonese users easier, here's an helper ready to use. So to implement in the "cleanString" function, just proceed this way:
      PHP:
        public static function cleanString($string)
         {
           
      //Init the UTF8MB4 helper
           
      $extraHanzi = new Sedo_ExtraHanzi_Helper_Characters();
           
      //Encode characters using their unicode to bypass the safety regex
           
      $string $extraHanzi->encodeExtraHanzi($string);
           
           
      // only cover the BMP as MySQL only supports that
           
      $string preg_replace('/[\xF0-\xF7].../'''$string);

           
      //Get back characters under their original form
           
      $string $extraHanzi->decodeExtraHanzi($string);     

           return 
      strtr(strval($string), self::$_strClean);
         }
      Note that the helper can be extended easily. So if you want to support emoji, add their unicode (the ones with 5 characters) to the helper variable "$_extraHanziUnicodeTable" ; for example "1F603" to support the smiling face with open mouth.
    • Demo:
      demo.jpg
    Don't forget:
    If any script, even XenForo, revert your table collation to utf8_general_ci, all mb4 characters will be messed up and lost. So make backups.
     
    rafass, GliX, eva2000 and 5 others like this.
  3. Xon

    Xon Well-Known Member

    This is not a small change. At all.

    I echo @cclaerhout that people will need to backup their databases!

    As this change affects all user text input, basically any table which takes a variable length text string will probably also need updating.

    Essentially anything with entered text, user or admin could be effected.
     
    rafass, woi911, imthebest and 2 others like this.
  4. tyteen4a03

    tyteen4a03 Well-Known Member

    I truly hope XF2 will bring full UTF8 support. Thank god MySQL 5.1 is EOL.
     
    RhysR and Xon like this.
  5. Xon

    Xon Well-Known Member

    You need MySQL 5.5 for full UTF8 support, not 5.1 :(
     
  6. tyteen4a03

    tyteen4a03 Well-Known Member

    Which is why I mentioned that 5.1 has reached End Of Life.
     
  7. I think @tyteen4a03 means that thanks that MySQL 5.1 is EOL users will be forced to upgrade and MySQL 5.5 features will be more mainstream / used
     
  8. RobinHood

    RobinHood Well-Known Member

  9. arn

    arn Active Member

    has anyone actually made these changes on a production site?
     
    jeffwidman likes this.
  10. Tiki Tiki

    Tiki Tiki Active Member

    @arn, have you tried this?
     
  11. arn

    arn Active Member

    nope, never got the nerve to
     
  12. ErnieW

    ErnieW New Member

    I wish there was more emojis available. Hoping this will be in an upgrade soon.
     
  13. Daniel Hood

    Daniel Hood Well-Known Member

    This could be done with an add-on but it's nerve racking trusting customers to be responsible and make a back up of their database before installing.
     
    Xon and Ernest L. Defoe like this.
  14. Ernest L. Defoe

    Ernest L. Defoe Well-Known Member

    Yeah but that wouldn't be the addon developers problem if they specifically state to backup their database before installing. Then it's the users fault for not heeding the advice of the developer.
     
  15. Daniel Hood

    Daniel Hood Well-Known Member

    Looking at it, it'd still require making a direct edit to /library/XenForo/Input.php.

    Basically to make this an add-on all you have to do is make adapters that extend Zend/Db/Adapater/*.php. Make, for example:

    PHP:
    class UtfSupport_Zend_Db_Adapter_Mysqli extends Zend_Db_Adapter_Mysqli
    {
        public function 
    __construct($config)
        {
    // below this line until $config['charset'] was copied from the Zend_Db_Adapter_Mysqli class.
            
    if (!is_array($config)) {
                
    /*
                 * Convert Zend_Config argument to a plain array.
                 */
                
    if ($config instanceof Zend_Config) {
                    
    $config $config->toArray();
                } else {
                    
    /**
                     * @see Zend_Db_Adapter_Exception
                     */
                    
    require_once 'Zend/Db/Adapter/Exception.php';
                    throw new 
    Zend_Db_Adapter_Exception('Adapter parameters must be in an array or a Zend_Config object');
                }
            }
            
    $config['charset'] = 'utf8mb4';
            
    parent::__construct($config);
        }
    }
    Then in the installer convert the table charsets.

    Having to include the Sedo_ExtraHanzi_Helper_Characters helper that @cclaerhout, making people edit XenForo/Input.php directly, and trusting customers to back up their tables before doing this risky procedure all make me want to avoid this.
     
    Xon likes this.
  16. PunKeel

    PunKeel Member

    Hi there,

    I've done a simple addon translating the Input to the EmojiOne format, and then converting it back into smileys at render time ... Works well, without any impact on the database. Just saying ... ;)
     
  17. Dakota Storm

    Dakota Storm Well-Known Member

    couldn't the addon backup the tables? on the settings page you could have the admin enter the db details and do and automated backup of the tables before it lets them convert.
     
  18. Xon

    Xon Well-Known Member

    I'ld prefer an add-on didn't try to backup ~30gb of database. For a small forum this makes sense, but large ones? Nope.
     
    Daniel Hood and eva2000 like this.
  19. Tracy Perry

    Tracy Perry Well-Known Member

    What! You mean everyone doesn't have 2TB free on their server? ;)
     
  20. Biker

    Biker Well-Known Member

    Not so much having the space as it is having 3rd party code back up something that should be done from the CLI to begin with.
     
    Xon likes this.

Share This Page