• This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.

Implemented True UTF-8 support - MySQL utf8mb4

Xon

Well-known member
#1
MySQL's utf8 data type only supports 3-byte characters, not the full 4 byte characters.

MySQL only recently as of MySQL 5.5 now supports the full UTF-8 encoding set via the table coolation type utf8mb4.

This means ios emoji icons (and likely other mobile devices who move to offer the same support) and non-English characters are impacted.

After derailing a support ticket, and seeing it happening to another one;
People involved in those last threads who may be interested: @RobinHood, @Tracy Perry, @cclaerhout, @Omar Bazavilvazo @tyteen4a03
 

cclaerhout

Well-known member
#2
If your SQL version is compatible with the utf8mb4 charset, here are the steps to follow to update your board:
  1. Backup your Database and perform a test at first on a local installation
  2. #SQL# Alter your SQL tables collation ; you can use the "utf8mb4_general_ci"
    If you only want to make a small test for posts, you will also have to alter any related tables ; ie:
    Code:
    ALTER TABLE xf_post CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
    ALTER TABLE xf_search_index CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci
  3. #PHP# Edit the class "XenForo_Application"
    Search for:
    PHP:
            'charset' => 'utf8'
    Replace with:
    PHP:
            'charset' => 'utf8mb4'
  4. #PHP# Edit the class "XenForo_Input"
    You will have to implement a characters conversion table in the function "cleanString" to safely bypass the below regex (ref):
    PHP:
    $string = preg_replace('/[\xF0-\xF7].../', '', $string);

    For those interested with the HKSCS to make the life of your Cantonese users easier, here's an helper ready to use. So to implement in the "cleanString" function, just proceed this way:
    PHP:
      public static function cleanString($string)
       {
         //Init the UTF8MB4 helper
         $extraHanzi = new Sedo_ExtraHanzi_Helper_Characters();
         //Encode characters using their unicode to bypass the safety regex
         $string = $extraHanzi->encodeExtraHanzi($string);
         
         // only cover the BMP as MySQL only supports that
         $string = preg_replace('/[\xF0-\xF7].../', '', $string);
    
         //Get back characters under their original form
         $string = $extraHanzi->decodeExtraHanzi($string);     
    
         return strtr(strval($string), self::$_strClean);
       }
    Note that the helper can be extended easily. So if you want to support emoji, add their unicode (the ones with 5 characters) to the helper variable "$_extraHanziUnicodeTable" ; for example "1F603" to support the smiling face with open mouth.
  • Demo:
    demo.jpg
Don't forget:
If any script, even XenForo, revert your table collation to utf8_general_ci, all mb4 characters will be messed up and lost. So make backups.
 

Xon

Well-known member
#3
This is not a small change. At all.

I echo @cclaerhout that people will need to backup their databases!

If your SQL version is compatible with the utf8mb4 charset, here are the steps to follow to update your board:
  1. Backup your Database and perform a test at first on a local installation
  2. #SQL# Alter your SQL tables collation ; you can use the "utf8mb4_general_ci"
    If you only want to make a small test for posts, you will also have to alter any related tables ; ie:
    Code:
    ALTER TABLE xf_post CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
    ALTER TABLE xf_search_index CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci
As this change affects all user text input, basically any table which takes a variable length text string will probably also need updating.

Essentially anything with entered text, user or admin could be effected.
 

Ernest L. Defoe

Well-known member
#14
This could be done with an add-on but it's nerve racking trusting customers to be responsible and make a back up of their database before installing.
Yeah but that wouldn't be the addon developers problem if they specifically state to backup their database before installing. Then it's the users fault for not heeding the advice of the developer.
 

Daniel Hood

Well-known member
#15
Looking at it, it'd still require making a direct edit to /library/XenForo/Input.php.

Basically to make this an add-on all you have to do is make adapters that extend Zend/Db/Adapater/*.php. Make, for example:

PHP:
class UtfSupport_Zend_Db_Adapter_Mysqli extends Zend_Db_Adapter_Mysqli
{
    public function __construct($config)
    {
// below this line until $config['charset'] was copied from the Zend_Db_Adapter_Mysqli class.
        if (!is_array($config)) {
            /*
             * Convert Zend_Config argument to a plain array.
             */
            if ($config instanceof Zend_Config) {
                $config = $config->toArray();
            } else {
                /**
                 * @see Zend_Db_Adapter_Exception
                 */
                require_once 'Zend/Db/Adapter/Exception.php';
                throw new Zend_Db_Adapter_Exception('Adapter parameters must be in an array or a Zend_Config object');
            }
        }
        $config['charset'] = 'utf8mb4';
        parent::__construct($config);
    }
}
Then in the installer convert the table charsets.

Having to include the Sedo_ExtraHanzi_Helper_Characters helper that @cclaerhout, making people edit XenForo/Input.php directly, and trusting customers to back up their tables before doing this risky procedure all make me want to avoid this.
 
#16
Hi there,

I've done a simple addon translating the Input to the EmojiOne format, and then converting it back into smileys at render time ... Works well, without any impact on the database. Just saying ... ;)
 

Dakota Storm

Well-known member
#17
Looking at it, it'd still require making a direct edit to /library/XenForo/Input.php.

Basically to make this an add-on all you have to do is make adapters that extend Zend/Db/Adapater/*.php. Make, for example:

PHP:
class UtfSupport_Zend_Db_Adapter_Mysqli extends Zend_Db_Adapter_Mysqli
{
    public function __construct($config)
    {
// below this line until $config['charset'] was copied from the Zend_Db_Adapter_Mysqli class.
        if (!is_array($config)) {
            /*
             * Convert Zend_Config argument to a plain array.
             */
            if ($config instanceof Zend_Config) {
                $config = $config->toArray();
            } else {
                /**
                 * @see Zend_Db_Adapter_Exception
                 */
                require_once 'Zend/Db/Adapter/Exception.php';
                throw new Zend_Db_Adapter_Exception('Adapter parameters must be in an array or a Zend_Config object');
            }
        }
        $config['charset'] = 'utf8mb4';
        parent::__construct($config);
    }
}
Then in the installer convert the table charsets.

Having to include the Sedo_ExtraHanzi_Helper_Characters helper that @cclaerhout, making people edit XenForo/Input.php directly, and trusting customers to back up their tables before doing this risky procedure all make me want to avoid this.
couldn't the addon backup the tables? on the settings page you could have the admin enter the db details and do and automated backup of the tables before it lets them convert.
 

Biker

Well-known member
#20
Not so much having the space as it is having 3rd party code back up something that should be done from the CLI to begin with.