XF 1.1 Character encoding issues after vBulletin 3.8 import

Sidane

Active member
Apologies if there is an obvious answer to this elsewhere, didn't find one after a brief search.

I'm in the process of prepping my site to migrate to XenForo from vBulletin 3.8. On my local test server (OS X 10.8.3, Apache 2.2, MySQL 5.1) I've done a full import of the data but am having character encoding issues all over the place.

Example:

Text in vBulletin

André Villas-Boas

At 33 years old, being one of the youngest ever coaches in Porto's history he was given the task of bringing the league title back to Porto. Not only has he done that, he’s done a lot more: He won the Portuguese Super Cup, the Portuguese League, the Portugues Cup and the Europa League. As a cherry on top of a cake, the Portuguese League was won in flawless fashion, with 27 wins, 3 draws and no defeats. An unbeatable run that had only happened once (Benfica in the 70’s) in the 77 years of history of the league. Best attack, best defense. As much as he tries, he can’t avoid the comparisons with Mourinho, who also won the three major competitions on his first season. Villas-Boas bettered it though, with the flawless league performance, and a Super Cup as an extra. The only competition Porto didn’t win was the League Cup, a relatively new cup where Porto suffered the first defeat of the season.

After importing to XenForo

André Villas-Boas

At 33 years old, being one of the youngest ever coaches in Porto's history he was given the task of bringing the league title back to Porto. Not only has he done that, he’s done a lot more: He won the Portuguese Super Cup, the Portuguese League, the Portugues Cup and the Europa League. As a cherry on top of a cake, the Portuguese League was won in flawless fashion, with 27 wins, 3 draws and no defeats. An unbeatable run that had only happened once (Benfica in the 70’s) in the 77 years of history of the league. Best attack, best defense. As much as he tries, he can’t avoid the comparisons with Mourinho, who also won the three major competitions on his first season. Villas-Boas bettered it though, with the flawless league performance, and a Super Cup as an extra. The only competition Porto didn’t win was the League Cup, a relatively new cup where Porto suffered the first defeat of the season.

My vBulletin installation is a standard one. Some character set queries on the VB database:

Code:
show variables like "character_set_database";
latin1

Code:
show variables like "collation_database";
latin1_swedish_ci

Code:
SELECT charset FROM language; 
ISO-8859-1

On the new XenForo database:

Code:
show variables like "character_set_database";
utf8

Code:
show variables like "collation_database";
utf8_general_ci

When I ran the XenForo importer, I didn't specify a value in the Force Character Set field as there is no encoding specified in my VB config.php.

Any help on why this happening and a possible solution? I will be doing a full reimport again but it takes about 16 hours and I want to make sure that there will be no encoding issues.

Thanks in advance! :)
 
I've setup a fresh instance of Xenforo and imported all users with the 'Force Character Set' set to 'utf8'.

The following user on the live vBulletin site has a À in his username, see http://www.redcafe.net/members/privateserve%C0%3F/

After this fresh import, the À is appearing as Ã:

username_encoding.png


So no joy there :( Any other ideas what could be wrong?
 
I was about to ask for a copy of your db but then I saw 12 million posts. :o

The conversion function does rely on certain PHP extensions:

library/XenForo/Importer/Abstract.php

Rich (BB code):
	/**
	 * Convert the given text to valid UTF-8
	 *
	 * @param string $string
	 * @param boolean $entities Convert &lt; (and other) entities back to < characters
	 *
	 * @return string
	 */
	protected function _convertToUtf8($string, $entities = null)
	{
		// note: assumes charset is ascii compatible
		if (preg_match('/[\x80-\xff]/', $string))
		{
			if (function_exists('iconv'))
			{
				$string = @iconv($this->_charset, 'utf-8//IGNORE', $string);
			}
			else if (function_exists('mb_convert_encoding'))
			{
				$string = mb_convert_encoding($string, 'utf-8', $this->_charset);
			}
		}

		$string = utf8_unhtml($string, $entities);
		$string = preg_replace('/[\xF0-\xF7].../', '', $string);
		$string = preg_replace('/[\xF8-\xFB]..../', '', $string);
		return $string;
	}
}

Those two functions come from these extensions:

http://us3.php.net/manual/en/iconv.installation.php
http://us3.php.net/manual/en/mbstring.installation.php

If both are missing then it would fail to convert. This is something you can check in your PHP configuration, or debug those functions to make sure they are working on your server.

Otherwise I can take a look if you give me access to your server.
 
I would actually try forcing the connection character set to latin1. It's actually being "double converted". The data being given to XF is already in UTF-8, but because the settings in the DB think it's coming from latin1, it's converting that to UTF-8. Whenever you see "simple" accented characters going to 2 bytes, it's almost always this.

If you're doing everything on the same server as vB, you shouldn't have to force the character set unless you are in vB's config.php, but if you're doing it on a different server, then your MySQL config may be different so bets are off and you may need to add (or remove) something there.
 
Rich (BB code):
/**
* Convert the given text to valid UTF-8
*
* @param string $string
* @param boolean $entities Convert &lt; (and other) entities back to < characters
*
* @return string
*/
protected function _convertToUtf8($string, $entities = null)
{
// note: assumes charset is ascii compatible
if (preg_match('/[\x80-\xff]/', $string))
{
if (function_exists('iconv'))
{
$string = @iconv($this->_charset, 'utf-8//IGNORE', $string);
}
else if (function_exists('mb_convert_encoding'))
{
$string = mb_convert_encoding($string, 'utf-8', $this->_charset);
}
}
 
$string = utf8_unhtml($string, $entities);
$string = preg_replace('/[\xF0-\xF7].../', '', $string);
$string = preg_replace('/[\xF8-\xFB]..../', '', $string);
return $string;
}
}

Both functions exist:

Code:
php -r "if (function_exists('iconv')) { echo 'yes'; } else { echo 'no'; }"
yes


Code:
php -r "if (function_exists('mb_convert_encoding')) { echo 'yes'; } else { echo 'no'; }"
yes

I would actually try forcing the connection character set to latin1. It's actually being "double converted". The data being given to XF is already in UTF-8, but because the settings in the DB think it's coming from latin1, it's converting that to UTF-8. Whenever you see "simple" accented characters going to 2 bytes, it's almost always this.

Thanks, will give that a try.
 
I would actually try forcing the connection character set to latin1. It's actually being "double converted". The data being given to XF is already in UTF-8, but because the settings in the DB think it's coming from latin1, it's converting that to UTF-8. Whenever you see "simple" accented characters going to 2 bytes, it's almost always this.

If you're doing everything on the same server as vB, you shouldn't have to force the character set unless you are in vB's config.php, but if you're doing it on a different server, then your MySQL config may be different so bets are off and you may need to add (or remove) something there.

Success! Setting Force Character Set to latin1 did the trick.

Thanks Mike!
 
Top Bottom