converting to utf-8 issues

EasyTarget · Oct 12, 2011

I doing a conversion for a forum with multiple languages and I ran into an issue with the vBulletin import utility. Hopefully my solution will help others in the future.

The original database was created with latin1_swedish_ci collation. The foreign languages were stored this way with the end user having to change the encoding on their browser for things to display correctly. Well to make things more interesting some posts would have a combination of languages in their field.

XF's import utility couldn't handle the mismash of the way the data is stored.

The first thing I needed to do was to convert the field to utf8. In order to do this I used a simple sql query that will convert and insert the converted field values into a new table.

Code:

CONVERT(CONVERT(UNHEX(HEX(forum.title)) USING cp1251) USING utf8) as title,

What it does is translates the field value to hex and back, then converts it from the latin1 charset to cp1251, and then converts it from cp1251 to utf8.

The reason this works is that MySql stores field values as blobs and not string literals. Hence why in sql you can do cast and convert. The reason to convert to HEX then UNHEX the field value is that convert has problems wth several charsets in properly recognizing the end-byte for 4-byte characters.

The only issue with this is you need to know what the intermediate charset is. In my case I know that the it is Russian and cp1251.

When the selected field values converted in displaying correctly I then renamed the original table to table name_bak and the new table name to the original table name.

When trying to import XF's import utility then will try and convert values using the function
_convertToUtf8

There is a small issue because if a field value is already in utf8 format, it will try and re-convert the value. So I changed to the lines to the following.

PHP:

if (preg_match('/[\x80-\xff]/', $string))

//changed to

if (preg_match('/[\x80-\xff]/', $string) && !mb_check_encoding($string, 'UTF-8'))

rationale behind this is that if the field is already utf8 encoded, we skip this process.

The other change I made was

PHP:

return preg_replace('/[\xF0-\xF7].../', '', $string);
//changed to
return ($string);

because there is no need to strip out these chars.

converting to utf-8 issues

EasyTarget

Well-known member

Similar threads

We value your privacy