Cannot reproduce Certain HTML unicode entities (&#...;) truncate posts on VB3 import

sheel

Active member
VB3 has a habit of storing characters not in it's character set as &#number; in the database (at least without UTF8 rolled out everywhere, which is a huge problem on existing data because php serialization abuse, and maybe depending on other configuration too).

Problem is, certain chars in this format cause the imported post in XF to end right before this char. No garbage, just truncated.

An example are 8222 and 8220 (0x201e and 0x201c), which are fancy quotation marks, one for the beginning of the quoted part and one for the end. "“" in VBs db doesn't become the character, but cuts the text off.

(Maybe somewhat related: https://xenforo.com/community/threads/truncated-posts-after-ipb-3-4-import.104750/ )

...

If a user with the same problem sees this: Tell me if you want a description how to correct the bad posts after the importer broke them (of course the VB database is needed too; it's not fixable if you deleted it).
 
Last edited:
The linked bug isn't related -- that was specific to HTML manipulation.

I can't reproduce this. I've done a test script to convert this from Windows-1252:
Code:
aaa \xA9 “test„ \xA7 bbb
Which converts to:
Code:
aaa © “test„ § bbb
This works properly as does inserting it into the DB and retrieving the value.

We've seen some situations where the initial character set conversion stops upon reaching an invalid character. We have the //IGNORE handler set for this, but there have been a few cases where iconv ignores this. Unfortunately, we can't really workaround a broken iconv config like that. (This is why I added the direct Windows-1252 characters; the conversion isn't attempted if the entire data fits in 7-bit ASCII.)

Can you provide the original string that triggered this (as an attached file) and the original character set we're converting from?
 
Back
Top Bottom