Fixed  Import fails on posts with special characters from MS Word.

Baron

Member
When doing an import last night with Beta 2, it failed about 3/4ths of the way through. After checking the server error log, I was able to track down the post that was causing this. It appears that the user copied and pasted a research report originally done in MS word, and the importer is crashing when encountering the code for a square looking bullet point character. When I deleted this character, the import continued without a problem. Here's an excerpt from the source of the post:


􀂙 Company reported a $0.01 for 1Q07, one cent lower than our
estimates and consensus of $0.02. We are maintaining a Buy Rating,
and lowering our FY07 estimates from $0.25 to $0.21 and FY08
estimates from $0.30 to $0.27. We are maintaining a target price of $3 or
11x 2008 projections.
 
I can guess, but do you know the specific error message? It should be logged in the server error log part of the admin CP. It should be stripping out 4 byte UTF-8 characters, though maybe that is bugged.
 
Zend_Db_Statement_Mysqli_Exception: Mysqli statement execute error : Incorrect string value: '\xF4\x80\x82\x99 C...' for column 'message' at row 1 - library/Zend/Db/Statement/Mysqli.php:214
 
Looks like the fix for stripping out 4-byte UTF-8 characters was incorrect (my fault!). I appears to actually be stripping 5 byte UTF-8 characters (with a slight mistake), which aren't actually allowed by the RFC anyway. (As a note, MySQL only supports 3 bytes UTF-8 chars, which represents the BMP.)

To anyone having this issue, try this as a fix. In library/XenForo/Importer/vBulletin.php, change the following line:
Code:
return preg_replace('/[\xF8-\xFB].../', '', $string);
to:
Code:
return preg_replace('/[\xF0-\xF4].../', '', $string);
This code is effectively the last line in the file. I believe that should prevent/fix this error then.
 
Top Bottom