XF 1.5 vb3.8 import - messed up the character set in live import - latin1 to utf8

Travis-Mc

Member
I tested and imported one vb3.8 site to xenforo a few months ago and it went well. Then a couple of weeks ago I did another forum, but instead of putting latin1_swedish_ci in the Force Character Set field, I accidentally put $config['Mysqli'['latin1_swedish_ci'] from my notes from previous testing.

It didn't work and it took 3 weeks for me to find a problem on the live site.

So now I have the same problem many have run into during testing where some characters such as é now show up as é. The problem is I've been live for 3 weeks so I can't go back.

I thought about finding every character in latin1 that is different from utf8 (I'm guessing there are less than 150), then search the database for all affected columns. Then do a query for each column like this with all of the replacements (not just the three below):

Code:
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'á','á');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ä','ä');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'é','é');
....

I thought there had to be a better way. After spending hours reading blogs and stackoverflow I'm hoping I found one.

What I did was created a copy of the live site to a test site.

In the database for the test site, I converted the column custom_fields in the table xf_user_profile from mediumblob to mediumtext, since I found out that that contents of the blob won't be converted with the next steps because they are already stored as binary.

Then I did a database dump on the test db forcing the character set to latin1 like this:
Code:
mysqldump -u username -p --opt --quote-names --skip-set-charset --default-character-set=latin1  database > fix_char.sql

Then import the backup to the test db with utf8
Code:
mysql -u username -p --default-character-set=utf8 database < fix_char.sql

Then convert the column custom_fields in the table xf_user_profile back to mediumblob.

I need to do a lot more looking around and testing on the test site, but it seems to have worked. Is there anything I'm missing? Has anyone done this before with success? Anything I should worry about or look into before trying the fix on the live site?

Is there a reason that xf_user_profile is stored as a binary that I might be messing up with this process?

Thanks in advance!
 
Is there a reason that xf_user_profile is stored as a binary that I might be messing up with this process?
If there are values that are changed, you run the risk of breaking the serialized data if the byte length changes. If this were actually a Latin-1 (ISO-8859-1/Windows-1252) to UTF-8 conversion, anything outside 7 bit ASCII would be changed and it would break this field. The data in it, while possibly being readable (but not always)., is definitely a binary format. However, since you're converting 2 bytes to 2 bytes, you'll probably be ok but I really can't guarantee it.
 
Well I found a problem on the test site. Someone used é in a post since converting to xenforo and the é and everything after it is missing (all on the test site). Any ideas on how to fix my problem since the mysql dump and import didn't seem to work?
 
Any sort of non-targeted DB editing is going to have side effects. You may need to use the query based approach and just manually edit specific "strange" cases outside of that.
 
Bummer.

Why does this happen anyway? During the import it says "If you are setting a character set in vBulletin via config.php, you should enter that value here." I had nothing set in config.php. The server vb3.8 was on was set to UTF8 and the all of the tables in the database were set to utf8_general_ci. The post table and pagetext field were both set to utf8_general_ci. If everything is utf8_general_ci, why do I need to enter latin1 to get the import to work correctly?
 
Is the server you're importing from different from the one the data originally existed on? The collation of the actual tables isn't the particularly relevant part here -- it's actually the connection character set. That's what the force character set option changes. It forces latin1 out. Unless you were actually using UTF-8 as the final output character set previously (as in, what was in the HTML source), that connection character set was probably latin1.
 
Yes the importing server is a different server than the data originally existed on. The connection character set was whatever vb3.8 is set to by default. So it seems as though I was storing latin1 data in utf8 tables. Thanks for the help and explanations.
 
Top Bottom