Fixed  unserialize offset error importing PMs from utf8/vb3

mlx

Well-known member
Our source database is using utf8_general_ci.

Our source vBulletin installation is also using mysqli and
Code:
$config['Mysqli']['charset'] = 'utf8';

Looking at vb3_pmtext:touserarray one row clearly says
Code:
a:1:{i:1284;s:23:"Hypnotische Schildkröte";}

While the importer somehow comes up with:
Code:
a:1:{i:1284;s:23:"Hypnotische Schildkröte";

and errors out with a unserialize offset error due to that additional character.

To continue beta testing we've helped ourselves by adding
Code:
$pmText['touserarray'] = iconv("UTF-8", "ISO-8859-1", $pmText['touserarray']);
to the importer.

For some reason the importer seems to handle everything else just fine (posts, threads, forums, etc.)

So I'm not sure why this ugly fix is required there? Does it handle a source database in utf8 that's been using mysqli charset utf8 correctly?

Or is there indeed something wrong with our database that I'm missing?

P.S. this is somewhat already discussed here: http://xenforo.com/community/threads/hung-up-on-importing-pms.4904/

If you want to merge these thread feel free to do so. I just thought I better post a new thread with all these technical details. Hope you don't mind.
 
Another thing. While it fixed most rows we still end up adding some local hotfixes like
Code:
$pmText['touserarray'] = str_replace('s:7:"Mañana"', 's:6:"Mañana"', $pmText['touserarray']);

I guess we might have messed it up converting from latin1 to utf8 a few years ago. You know ... these few rows actually have a broken serialized array in the database unlike the other rows I've been talking about above. So it's possible that we've just converted the data to utf8, probably reducing the length by 1 character, without updating that serialized array accordingly.

But still it might be another thing to consider for the importer. I know it's messy, but I guess I might not be alone with these issues.
 
What character set was the data being imported from? ISO-8859-1?

The charset line you posted indicates that it should be in UTF-8, but the first serialized data seems to have a count that indicates it's ISO-8859-1 (hence your conversion code). If it is ISO-8859-1, the charset config line confuses me...

The importer DB connection is supposed to use the "binary" charset, so that nothing is converted as it's read out of the database.
 
I believe I have this fixed now - the importer gives you the option to force a particular character set when reading from the source DB. By default, it's nothing (which is probably the norm), but it looks like you could force utf8.

It's intentionally not catching any unserialize calls that fail yet. Once I have more confidence that the only time that happens is when the source data is broken, I'll add that to add resiliency.
 
That sounds great. Thanks a lot.

And yeah, it's definitely all utf8 now. Some very old data used to be latin1 and we converted it a few years ago. We've been using iconv there. It worked fine for post titles, messages, etc. I guess it might have messed up that serialized PM data though. We didn't notice that back then.
 
We finally did another test import.

Using beta 3 and entering utf8 for the import charset.

Not a single error!

Thanks again!
 
Top Bottom