Fixed vBulletin importers - unicode entities are not decoded

Here's all the areas I've found this issue:
  • xf_conversation_data
    • username
    • last_message_username
    • recipients
  • xf_conversation_message
    • username
  • xf_conversation_user
    • last_message_username
  • xf_edit_history
    • old_text
  • xf_post
    • username
    • message
  • xf_profile_post
    • message
  • xf_search_index (this might be possible to ignore since you should probably rebuild search index afterwards)
    • title
    • message
  • xf_thread
    • title
    • username
    • last_post_username
  • xf_user
    • username
  • xf_user_profile
    • signature

I'm going to be doing some experiments with this since I'm hoping to get DBTech imported today and this is a bit of a blocking issue :)


Fillip
 
I think the issue is that XF/Import/Data/EntityEmulator uses "htmlspecialchars_decode" instead of "html_entity_decode" to decode HTML-encoded strings. "htmlspecialchars_decode" doesn't decode all HTML-encoded characters.

PHP manual on htmlspecialchars_decode said:
The converted entities are: &, " (when ENT_NOQUOTES is not set), ' (when ENT_QUOTES is set), < and >.
 
I have (probably) found the issue. htmlspecialchars_decode doesn't do what we think it does.

AXYI5H9.png


I'll explore for solutions and post back if anything promising turns up.

Edit: I WOULD HAVE GOTTEN AWAY WITH IT IF IT WASN'T FOR THAT MEDDLING STEFFEN! :P


Fillip
 
@Steffen can confirm that html_entity_decode works for me. I'm going to use $value = html_entity_decode(htmlspecialchars_decode(strval($value)), ENT_QUOTES, 'UTF-8'); in place of both instances of $value = htmlspecialchars_decode(strval($value)); and see how that works out.

I tested making a new post and it appears as if XF2 wants both single and double quotes un-entity'd so the above code is probably correct.

Hold my tea, I'm going in.


Fillip
 
After the upgrade, the following areas still exhibit the issue reported in the OP:
  • xf_conversation_message
    • message
  • xf_edit_history
    • old_text
  • xf_post
    • message
  • xf_user_profile
    • signature
I'll be applying the changes to /src/addons/XFI/Import/Importer/vBulletin.php (and other files where applicable) and post back with a diff once I've successfully imported everything without any more issues.


Fillip
 
I'm going to use $value = html_entity_decode(htmlspecialchars_decode(strval($value)), ENT_QUOTES, 'UTF-8'); [...]
This seems redundant. "htmlspecialchars_decode" is a subset of "html_entity_decode". You don't need both.

Using both will probably be fine for 99% of your posts but if you have a programming forum where someone intentionally wrote "&" (for example while discussing encoding issues like we do here ^^) and vBulletin stored this as "&" in its database then this would now be picked-up by XenForo not as "&" but as "&".
 
Attached are the diff files for src/XF/Import/Data/EntityEmulator.php and src/addons/XFI/Import/Importer/vBulletin.php

I have confirmed that no instance of &# exists in the database in any column in any table other than intentional entities (e.g. posts where we are explaining the difference between a character and an entity, so the closing ; was omitted on purpose), or entities that were broken in vBulletin so the fault is not in XF2.

I implemented @Steffen's change above but I have not tested that, I'm going to proceed as if it works.


Fillip
 

Attachments

I'm wondering why you have to HTML-decode BBCode-enabled database columns (post messages, conversation messages, signatures, and edit history texts). I didn't have to HTML-decode them. But maybe that's because my vB4 forum was already using UTF-8. Is your old vBulletin forum stil using ISO-8859-*?
 
I'm wondering why you have to HTML-decode BBCode-enabled database columns (post messages, conversation messages, signatures, and edit history texts). I didn't have to HTML-decode them. But maybe that's because my vB4 forum was already using UTF-8. Is your old vBulletin forum stil using ISO-8859-*?
Yes it is. Database collation is latin1_swedish_ci (old MySQL default) and language charset is ISO-8859-1.


Fillip
 
Out of interested (and it might be helpful for the XenForo devs): In your post/conversation/… messages, are only non-ASCII characters like like &#8364; HTML-encoded? Or are &&amp;, <&lt;, >&gt;, "&quot;, and '&#039; HTML-encoded, too?
 
Back
Top Bottom