XF 1.1 Character encoding issues after vBulletin 3.8 import

Sidane

Active member
Apologies if there is an obvious answer to this elsewhere, didn't find one after a brief search.

I'm in the process of prepping my site to migrate to XenForo from vBulletin 3.8. On my local test server (OS X 10.8.3, Apache 2.2, MySQL 5.1) I've done a full import of the data but am having character encoding issues all over the place.

Example:

Text in vBulletin

André Villas-Boas

At 33 years old, being one of the youngest ever coaches in Porto's history he was given the task of bringing the league title back to Porto. Not only has he done that, he’s done a lot more: He won the Portuguese Super Cup, the Portuguese League, the Portugues Cup and the Europa League. As a cherry on top of a cake, the Portuguese League was won in flawless fashion, with 27 wins, 3 draws and no defeats. An unbeatable run that had only happened once (Benfica in the 70’s) in the 77 years of history of the league. Best attack, best defense. As much as he tries, he can’t avoid the comparisons with Mourinho, who also won the three major competitions on his first season. Villas-Boas bettered it though, with the flawless league performance, and a Super Cup as an extra. The only competition Porto didn’t win was the League Cup, a relatively new cup where Porto suffered the first defeat of the season.
After importing to XenForo

André Villas-Boas

At 33 years old, being one of the youngest ever coaches in Porto's history he was given the task of bringing the league title back to Porto. Not only has he done that, he’s done a lot more: He won the Portuguese Super Cup, the Portuguese League, the Portugues Cup and the Europa League. As a cherry on top of a cake, the Portuguese League was won in flawless fashion, with 27 wins, 3 draws and no defeats. An unbeatable run that had only happened once (Benfica in the 70’s) in the 77 years of history of the league. Best attack, best defense. As much as he tries, he can’t avoid the comparisons with Mourinho, who also won the three major competitions on his first season. Villas-Boas bettered it though, with the flawless league performance, and a Super Cup as an extra. The only competition Porto didn’t win was the League Cup, a relatively new cup where Porto suffered the first defeat of the season.
My vBulletin installation is a standard one. Some character set queries on the VB database:

Code:
show variables like "character_set_database";
latin1
Code:
show variables like "collation_database";
latin1_swedish_ci
Code:
SELECT charset FROM language; 
ISO-8859-1
On the new XenForo database:

Code:
show variables like "character_set_database";
utf8
Code:
show variables like "collation_database";
utf8_general_ci
When I ran the XenForo importer, I didn't specify a value in the Force Character Set field as there is no encoding specified in my VB config.php.

Any help on why this happening and a possible solution? I will be doing a full reimport again but it takes about 16 hours and I want to make sure that there will be no encoding issues.

Thanks in advance! :)
 

Jake Bunce

XenForo moderator
Staff member
When I ran the XenForo importer, I didn't specify a value in the Force Character Set field as there is no encoding specified in my VB config.php.
Try reinstalling XF and specifying utf8 for the charset during the import.
 

Sidane

Active member
Thanks, but does that value not represent the charset for the existing vBulletin database, i.e. latin1?
 

Jake Bunce

XenForo moderator
Staff member
It looks like the data is already utf8. The collations in your database may be incorrect.

If the import works with utf8 then you know that was the problem.
 

Sidane

Active member
I've setup a fresh instance of Xenforo and imported all users with the 'Force Character Set' set to 'utf8'.

The following user on the live vBulletin site has a À in his username, see http://www.redcafe.net/members/privateserve%C0%3F/

After this fresh import, the À is appearing as Ã:



So no joy there :( Any other ideas what could be wrong?
 

Jake Bunce

XenForo moderator
Staff member
I was about to ask for a copy of your db but then I saw 12 million posts. :eek:

The conversion function does rely on certain PHP extensions:

library/XenForo/Importer/Abstract.php

Rich (BB code):
	/**
	 * Convert the given text to valid UTF-8
	 *
	 * @param string $string
	 * @param boolean $entities Convert &lt; (and other) entities back to < characters
	 *
	 * @return string
	 */
	protected function _convertToUtf8($string, $entities = null)
	{
		// note: assumes charset is ascii compatible
		if (preg_match('/[\x80-\xff]/', $string))
		{
			if (function_exists('iconv'))
			{
				$string = @iconv($this->_charset, 'utf-8//IGNORE', $string);
			}
			else if (function_exists('mb_convert_encoding'))
			{
				$string = mb_convert_encoding($string, 'utf-8', $this->_charset);
			}
		}

		$string = utf8_unhtml($string, $entities);
		$string = preg_replace('/[\xF0-\xF7].../', '', $string);
		$string = preg_replace('/[\xF8-\xFB]..../', '', $string);
		return $string;
	}
}
Those two functions come from these extensions:

http://us3.php.net/manual/en/iconv.installation.php
http://us3.php.net/manual/en/mbstring.installation.php

If both are missing then it would fail to convert. This is something you can check in your PHP configuration, or debug those functions to make sure they are working on your server.

Otherwise I can take a look if you give me access to your server.
 

AlexT

Well-known member
FWIW, the code Jake cited is a good place for throwing an exception if both functions don't exist.
 

Mike

XenForo developer
Staff member
I would actually try forcing the connection character set to latin1. It's actually being "double converted". The data being given to XF is already in UTF-8, but because the settings in the DB think it's coming from latin1, it's converting that to UTF-8. Whenever you see "simple" accented characters going to 2 bytes, it's almost always this.

If you're doing everything on the same server as vB, you shouldn't have to force the character set unless you are in vB's config.php, but if you're doing it on a different server, then your MySQL config may be different so bets are off and you may need to add (or remove) something there.
 

Sidane

Active member
  • #10
Rich (BB code):
/**
* Convert the given text to valid UTF-8
*
* @param string $string
* @param boolean $entities Convert &lt; (and other) entities back to < characters
*
* @return string
*/
protected function _convertToUtf8($string, $entities = null)
{
// note: assumes charset is ascii compatible
if (preg_match('/[\x80-\xff]/', $string))
{
if (function_exists('iconv'))
{
$string = @iconv($this->_charset, 'utf-8//IGNORE', $string);
}
else if (function_exists('mb_convert_encoding'))
{
$string = mb_convert_encoding($string, 'utf-8', $this->_charset);
}
}
 
$string = utf8_unhtml($string, $entities);
$string = preg_replace('/[\xF0-\xF7].../', '', $string);
$string = preg_replace('/[\xF8-\xFB]..../', '', $string);
return $string;
}
}
Both functions exist:

Code:
php -r "if (function_exists('iconv')) { echo 'yes'; } else { echo 'no'; }"
yes

Code:
php -r "if (function_exists('mb_convert_encoding')) { echo 'yes'; } else { echo 'no'; }"
yes
I would actually try forcing the connection character set to latin1. It's actually being "double converted". The data being given to XF is already in UTF-8, but because the settings in the DB think it's coming from latin1, it's converting that to UTF-8. Whenever you see "simple" accented characters going to 2 bytes, it's almost always this.
Thanks, will give that a try.
 

Sidane

Active member
  • #11
I would actually try forcing the connection character set to latin1. It's actually being "double converted". The data being given to XF is already in UTF-8, but because the settings in the DB think it's coming from latin1, it's converting that to UTF-8. Whenever you see "simple" accented characters going to 2 bytes, it's almost always this.

If you're doing everything on the same server as vB, you shouldn't have to force the character set unless you are in vB's config.php, but if you're doing it on a different server, then your MySQL config may be different so bets are off and you may need to add (or remove) something there.
Success! Setting Force Character Set to latin1 did the trick.

Thanks Mike!
 

JoseFebus

Member
  • #12
Hi Sidane,

I hope you are doing great!

How you were able to force the Char Set?

Best Regards
 

Jeremy

Well-known member
  • #16
Re-importing without re-installing will cause duplicated content.
 

JoseFebus

Member
  • #17
I noticed the collation of the vb tables is latin1_swedish_ci, should I use "latin1_swedish_ci" then importing?
 
Top