Fixed vBulletin importers - unicode entities are not decoded

Slavik · Mar 19, 2018

As title:

DragonByte Tech · Mar 19, 2018

Possible duplicate of https://xenforo.com/community/threads/vb4-import-double-html-encoding-of-thread-titles.144295/ - I don't know if @Kier gave the final set of diffs for testing but it certainly sounds like the same issue.

Fillip

Chris D · Mar 19, 2018

We asked @Slavik to report this. The import was done after the release of Importers 1.0.1.

Kier · Apr 5, 2018

@Slavik was this a thread title? Kinda need to know where this appeared to be able to do anything with it.

Slavik · Apr 5, 2018

Kier said:
@Slavik was this a thread title? Kinda need to know where this appeared to be able to do anything with it.

Thread message / post

DragonByte Tech · Apr 6, 2018

Here's all the areas I've found this issue:

xf_conversation_data
- username
- last_message_username
- recipients
xf_conversation_message
- username
xf_conversation_user
- last_message_username
xf_edit_history
- old_text
xf_post
- username
- message
xf_profile_post
- message
xf_search_index (this might be possible to ignore since you should probably rebuild search index afterwards)
- title
- message
xf_thread
- title
- username
- last_post_username
xf_user
- username
xf_user_profile
- signature

I'm going to be doing some experiments with this since I'm hoping to get DBTech imported today and this is a bit of a blocking issue

Fillip

Steffen · Apr 6, 2018

I think the issue is that XF/Import/Data/EntityEmulator uses "htmlspecialchars_decode" instead of "html_entity_decode" to decode HTML-encoded strings. "htmlspecialchars_decode" doesn't decode all HTML-encoded characters.

PHP manual on htmlspecialchars_decode said:
The converted entities are: &, " (when ENT_NOQUOTES is not set), ' (when ENT_QUOTES is set), < and >.

DragonByte Tech · Apr 6, 2018

I have (probably) found the issue. htmlspecialchars_decode doesn't do what we think it does.

I'll explore for solutions and post back if anything promising turns up.

Edit: I WOULD HAVE GOTTEN AWAY WITH IT IF IT WASN'T FOR THAT MEDDLING STEFFEN!

Fillip

Steffen · Apr 6, 2018

Sorry.

DragonByte Tech · Apr 6, 2018

@Steffen can confirm that html_entity_decode works for me. I'm going to use $value = html_entity_decode(htmlspecialchars_decode(strval($value)), ENT_QUOTES, 'UTF-8'); in place of both instances of $value = htmlspecialchars_decode(strval($value)); and see how that works out.

I tested making a new post and it appears as if XF2 wants both single and double quotes un-entity'd so the above code is probably correct.

Hold my tea, I'm going in.

Fillip

DragonByte Tech · Apr 6, 2018

After the upgrade, the following areas still exhibit the issue reported in the OP:

xf_conversation_message
- message
xf_edit_history
- old_text
xf_post
- message
xf_user_profile
- signature

I'll be applying the changes to /src/addons/XFI/Import/Importer/vBulletin.php (and other files where applicable) and post back with a diff once I've successfully imported everything without any more issues.

Fillip

Steffen · Apr 6, 2018

DragonByte Tech said:
I'm going to use $value = html_entity_decode(htmlspecialchars_decode(strval($value)), ENT_QUOTES, 'UTF-8'); [...]

This seems redundant. "htmlspecialchars_decode" is a subset of "html_entity_decode". You don't need both.

Using both will probably be fine for 99% of your posts but if you have a programming forum where someone intentionally wrote "&" (for example while discussing encoding issues like we do here ^^) and vBulletin stored this as "&amp;" in its database then this would now be picked-up by XenForo not as "&" but as "&".

DragonByte Tech · Apr 6, 2018

Attached are the diff files for src/XF/Import/Data/EntityEmulator.php and src/addons/XFI/Import/Importer/vBulletin.php

I have confirmed that no instance of &# exists in the database in any column in any table other than intentional entities (e.g. posts where we are explaining the difference between a character and an entity, so the closing ; was omitted on purpose), or entities that were broken in vBulletin so the fault is not in XF2.

I implemented @Steffen's change above but I have not tested that, I'm going to proceed as if it works.

Fillip

Steffen · Apr 6, 2018

I'm wondering why you have to HTML-decode BBCode-enabled database columns (post messages, conversation messages, signatures, and edit history texts). I didn't have to HTML-decode them. But maybe that's because my vB4 forum was already using UTF-8. Is your old vBulletin forum stil using ISO-8859-*?

DragonByte Tech · Apr 6, 2018

Steffen said:
I'm wondering why you have to HTML-decode BBCode-enabled database columns (post messages, conversation messages, signatures, and edit history texts). I didn't have to HTML-decode them. But maybe that's because my vB4 forum was already using UTF-8. Is your old vBulletin forum stil using ISO-8859-*?

Yes it is. Database collation is latin1_swedish_ci (old MySQL default) and language charset is ISO-8859-1.

Fillip

Steffen · Apr 7, 2018

Out of interested (and it might be helpful for the XenForo devs): In your post/conversation/… messages, are only non-ASCII characters like like € → € HTML-encoded? Or are & → &, < → <, > → >, " → ", and ' → ' HTML-encoded, too?

DragonByte Tech · Apr 7, 2018

I'll have a look once the import has completed, I'm in the middle of the actual import just now

Fillip

Chris D · Aug 27, 2018

These changes seem reasonable. Fixed for the next XF2 / XFI releases.

Fixed vBulletin importers - unicode entities are not decoded

Slavik

XenForo moderator

DragonByte Tech

Well-known member

Chris D

XenForo developer

Kier

XenForo developer

Slavik

XenForo moderator

DragonByte Tech

Well-known member

Steffen

Well-known member

DragonByte Tech

Well-known member

Steffen

Well-known member

DragonByte Tech

Well-known member

DragonByte Tech

Well-known member

Steffen

Well-known member

DragonByte Tech

Well-known member

Attachments

Steffen

Well-known member

DragonByte Tech

Well-known member

Steffen

Well-known member

DragonByte Tech

Well-known member

Chris D

XenForo developer

Similar threads

We value your privacy