1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Fixed unserialize offset error importing PMs from utf8/vb3

Discussion in 'Resolved Bug Reports' started by mlx, Oct 11, 2010.

  1. mlx

    mlx Well-Known Member

    Our source database is using utf8_general_ci.

    Our source vBulletin installation is also using mysqli and
    $config['Mysqli']['charset'] = 'utf8';
    Looking at vb3_pmtext:touserarray one row clearly says
    a:1:{i:1284;s:23:"Hypnotische Schildkröte";}
    While the importer somehow comes up with:
    a:1:{i:1284;s:23:"Hypnotische Schildkröte";
    and errors out with a unserialize offset error due to that additional character.

    To continue beta testing we've helped ourselves by adding
    $pmText['touserarray'] = iconv("UTF-8", "ISO-8859-1", $pmText['touserarray']);
    to the importer.

    For some reason the importer seems to handle everything else just fine (posts, threads, forums, etc.)

    So I'm not sure why this ugly fix is required there? Does it handle a source database in utf8 that's been using mysqli charset utf8 correctly?

    Or is there indeed something wrong with our database that I'm missing?

    P.S. this is somewhat already discussed here: http://xenforo.com/community/threads/hung-up-on-importing-pms.4904/

    If you want to merge these thread feel free to do so. I just thought I better post a new thread with all these technical details. Hope you don't mind.
  2. mlx

    mlx Well-Known Member

    Another thing. While it fixed most rows we still end up adding some local hotfixes like
    $pmText['touserarray'] = str_replace('s:7:"Mañana"', 's:6:"Mañana"', $pmText['touserarray']);
    I guess we might have messed it up converting from latin1 to utf8 a few years ago. You know ... these few rows actually have a broken serialized array in the database unlike the other rows I've been talking about above. So it's possible that we've just converted the data to utf8, probably reducing the length by 1 character, without updating that serialized array accordingly.

    But still it might be another thing to consider for the importer. I know it's messy, but I guess I might not be alone with these issues.
  3. Mike

    Mike XenForo Developer Staff Member

    What character set was the data being imported from? ISO-8859-1?

    The charset line you posted indicates that it should be in UTF-8, but the first serialized data seems to have a count that indicates it's ISO-8859-1 (hence your conversion code). If it is ISO-8859-1, the charset config line confuses me...

    The importer DB connection is supposed to use the "binary" charset, so that nothing is converted as it's read out of the database.
  4. Mike

    Mike XenForo Developer Staff Member

    I believe I have this fixed now - the importer gives you the option to force a particular character set when reading from the source DB. By default, it's nothing (which is probably the norm), but it looks like you could force utf8.

    It's intentionally not catching any unserialize calls that fail yet. Once I have more confidence that the only time that happens is when the source data is broken, I'll add that to add resiliency.
    mlx likes this.
  5. mlx

    mlx Well-Known Member

    That sounds great. Thanks a lot.

    And yeah, it's definitely all utf8 now. Some very old data used to be latin1 and we converted it a few years ago. We've been using iconv there. It worked fine for post titles, messages, etc. I guess it might have messed up that serialized PM data though. We didn't notice that back then.
  6. mlx

    mlx Well-Known Member

    We finally did another test import.

    Using beta 3 and entering utf8 for the import charset.

    Not a single error!

    Thanks again!

Share This Page