The original database was created with latin1_swedish_ci collation. The foreign languages were stored this way with the end user having to change the encoding on their browser for things to display correctly. Well to make things more interesting some posts would have a combination of languages in their field.
XF's import utility couldn't handle the mismash of the way the data is stored.
The first thing I needed to do was to convert the field to utf8. In order to do this I used a simple sql query that will convert and insert the converted field values into a new table.
CONVERT(CONVERT(UNHEX(HEX(forum.title)) USING cp1251) USING utf8) as title,
The reason this works is that MySql stores field values as blobs and not string literals. Hence why in sql you can do cast and convert. The reason to convert to HEX then UNHEX the field value is that convert has problems wth several charsets in properly recognizing the end-byte for 4-byte characters.
The only issue with this is you need to know what the intermediate charset is. In my case I know that the it is Russian and cp1251.
When the selected field values converted in displaying correctly I then renamed the original table to table name_bak and the new table name to the original table name.
When trying to import XF's import utility then will try and convert values using the function
There is a small issue because if a field value is already in utf8 format, it will try and re-convert the value. So I changed to the lines to the following.
if (preg_match('/[\x80-\xff]/', $string)) //changed to if (preg_match('/[\x80-\xff]/', $string) && !mb_check_encoding($string, 'UTF-8'))
The other change I made was
return preg_replace('/[\xF0-\xF7].../', '', $string); //changed to return ($string);