• This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.

XF 1.1 Character encoding issues after vBulletin 3.8 import

Sidane

Active member
#1
Apologies if there is an obvious answer to this elsewhere, didn't find one after a brief search.

I'm in the process of prepping my site to migrate to XenForo from vBulletin 3.8. On my local test server (OS X 10.8.3, Apache 2.2, MySQL 5.1) I've done a full import of the data but am having character encoding issues all over the place.

Example:

Text in vBulletin

André Villas-Boas

At 33 years old, being one of the youngest ever coaches in Porto's history he was given the task of bringing the league title back to Porto. Not only has he done that, he’s done a lot more: He won the Portuguese Super Cup, the Portuguese League, the Portugues Cup and the Europa League. As a cherry on top of a cake, the Portuguese League was won in flawless fashion, with 27 wins, 3 draws and no defeats. An unbeatable run that had only happened once (Benfica in the 70’s) in the 77 years of history of the league. Best attack, best defense. As much as he tries, he can’t avoid the comparisons with Mourinho, who also won the three major competitions on his first season. Villas-Boas bettered it though, with the flawless league performance, and a Super Cup as an extra. The only competition Porto didn’t win was the League Cup, a relatively new cup where Porto suffered the first defeat of the season.
After importing to XenForo

André Villas-Boas

At 33 years old, being one of the youngest ever coaches in Porto's history he was given the task of bringing the league title back to Porto. Not only has he done that, he’s done a lot more: He won the Portuguese Super Cup, the Portuguese League, the Portugues Cup and the Europa League. As a cherry on top of a cake, the Portuguese League was won in flawless fashion, with 27 wins, 3 draws and no defeats. An unbeatable run that had only happened once (Benfica in the 70’s) in the 77 years of history of the league. Best attack, best defense. As much as he tries, he can’t avoid the comparisons with Mourinho, who also won the three major competitions on his first season. Villas-Boas bettered it though, with the flawless league performance, and a Super Cup as an extra. The only competition Porto didn’t win was the League Cup, a relatively new cup where Porto suffered the first defeat of the season.
My vBulletin installation is a standard one. Some character set queries on the VB database:

Code:
show variables like "character_set_database";
latin1
Code:
show variables like "collation_database";
latin1_swedish_ci
Code:
SELECT charset FROM language; 
ISO-8859-1
On the new XenForo database:

Code:
show variables like "character_set_database";
utf8
Code:
show variables like "collation_database";
utf8_general_ci
When I ran the XenForo importer, I didn't specify a value in the Force Character Set field as there is no encoding specified in my VB config.php.

Any help on why this happening and a possible solution? I will be doing a full reimport again but it takes about 16 hours and I want to make sure that there will be no encoding issues.

Thanks in advance! :)
 

Jake Bunce

XenForo moderator
Staff member
#4
It looks like the data is already utf8. The collations in your database may be incorrect.

If the import works with utf8 then you know that was the problem.
 

Jake Bunce

XenForo moderator
Staff member
#7
I was about to ask for a copy of your db but then I saw 12 million posts. :eek:

The conversion function does rely on certain PHP extensions:

library/XenForo/Importer/Abstract.php

Code:
	/**
	 * Convert the given text to valid UTF-8
	 *
	 * @param string $string
	 * @param boolean $entities Convert &lt; (and other) entities back to < characters
	 *
	 * @return string
	 */
	protected function _convertToUtf8($string, $entities = null)
	{
		// note: assumes charset is ascii compatible
		if (preg_match('/[\x80-\xff]/', $string))
		{
			if (function_exists('iconv'))
			{
				$string = @iconv($this->_charset, 'utf-8//IGNORE', $string);
			}
			else if (function_exists('mb_convert_encoding'))
			{
				$string = mb_convert_encoding($string, 'utf-8', $this->_charset);
			}
		}

		$string = utf8_unhtml($string, $entities);
		$string = preg_replace('/[\xF0-\xF7].../', '', $string);
		$string = preg_replace('/[\xF8-\xFB]..../', '', $string);
		return $string;
	}
}
Those two functions come from these extensions:

http://us3.php.net/manual/en/iconv.installation.php
http://us3.php.net/manual/en/mbstring.installation.php

If both are missing then it would fail to convert. This is something you can check in your PHP configuration, or debug those functions to make sure they are working on your server.

Otherwise I can take a look if you give me access to your server.
 

Mike

XenForo developer
Staff member
#9
I would actually try forcing the connection character set to latin1. It's actually being "double converted". The data being given to XF is already in UTF-8, but because the settings in the DB think it's coming from latin1, it's converting that to UTF-8. Whenever you see "simple" accented characters going to 2 bytes, it's almost always this.

If you're doing everything on the same server as vB, you shouldn't have to force the character set unless you are in vB's config.php, but if you're doing it on a different server, then your MySQL config may be different so bets are off and you may need to add (or remove) something there.
 

Sidane

Active member
#10
Code:
/**
* Convert the given text to valid UTF-8
*
* @param string $string
* @param boolean $entities Convert &lt; (and other) entities back to < characters
*
* @return string
*/
protected function _convertToUtf8($string, $entities = null)
{
// note: assumes charset is ascii compatible
if (preg_match('/[\x80-\xff]/', $string))
{
if (function_exists('iconv'))
{
$string = @iconv($this->_charset, 'utf-8//IGNORE', $string);
}
else if (function_exists('mb_convert_encoding'))
{
$string = mb_convert_encoding($string, 'utf-8', $this->_charset);
}
}
 
$string = utf8_unhtml($string, $entities);
$string = preg_replace('/[\xF0-\xF7].../', '', $string);
$string = preg_replace('/[\xF8-\xFB]..../', '', $string);
return $string;
}
}
Both functions exist:

Code:
php -r "if (function_exists('iconv')) { echo 'yes'; } else { echo 'no'; }"
yes

Code:
php -r "if (function_exists('mb_convert_encoding')) { echo 'yes'; } else { echo 'no'; }"
yes
I would actually try forcing the connection character set to latin1. It's actually being "double converted". The data being given to XF is already in UTF-8, but because the settings in the DB think it's coming from latin1, it's converting that to UTF-8. Whenever you see "simple" accented characters going to 2 bytes, it's almost always this.
Thanks, will give that a try.
 

Sidane

Active member
#11
I would actually try forcing the connection character set to latin1. It's actually being "double converted". The data being given to XF is already in UTF-8, but because the settings in the DB think it's coming from latin1, it's converting that to UTF-8. Whenever you see "simple" accented characters going to 2 bytes, it's almost always this.

If you're doing everything on the same server as vB, you shouldn't have to force the character set unless you are in vB's config.php, but if you're doing it on a different server, then your MySQL config may be different so bets are off and you may need to add (or remove) something there.
Success! Setting Force Character Set to latin1 did the trick.

Thanks Mike!