Fixed vB4 Import: Double HTML-encoding of thread titles

Steffen · Mar 15, 2018

This was working fine in XenForo 2.0.2 but now fails in XenForo 2.0.3 with the Importers add-on. " is turned into " and so on.

Chris D · Mar 16, 2018

We're not actually seeing that here with XF 2.0.3 and Importers 1.0.0.

Is this happening for all threads or only some? Do you have a more solid example of an original thread title and how it appears after import?

We don't believe we made any relevant changes here.

Steffen · Mar 16, 2018

I'm not sure why I noticed the issue in XenForo 2.0.3 + XFI for the first time. I, too, cannot see which code change could have caused this issue. Maybe this issue was there before and I just didn't notice it?

vBulletin HTML-encodes some things before storing them in the database (for example usernames, thread titles and post titles but not post messages). Is there some code in the importer where this encoding is undone? I've just checked that usernames with special HTML characters aren't imported correctly, too: They are imported "as is" and therefore encoded a second time when XenForo displays them. For example, the username A&B is stored in vb_user as A&B and the importer just puts the very same pre-encoded value into xf_user.

Steffen · Mar 16, 2018

Untested theory: Since XenForo 2.0.3 + XFI, in XF/Import/Data/EntityEmulator::set, the method XF\Import\DataManager::convertToUtf8 (which for some reason calls utf8_unhtml) is only called for invalidly encoded strings (because of the added !preg_match('/./u', $value)). So validly encoded strings aren't HTML-decoded anymore.

But this call to utf8_unhtml feels completely wrong. It needs to be called where appropriate and not on every string (otherwise an intentionally written "&" in a programming forum will be converted to "&").

Probably incomplete list:

vb_forum: title, description
vb_pmtext: title, fromusername
vb_post: title, username
vb_thread: title, postusername, lastposter, notes
vb_user: username

Chris D · Mar 16, 2018

You're absolutely correct that it is happening there, due to that change. I noticed it a while ago. Incidentally, the nuts and bolts of the convertToUtf8 method haven't changed since it was written for XF1, and I don't recall it being problematic.

I'm tempted to remove it from there, but just call unhtml once we've got a valid string inside EntityEmulator::set.

Steffen · Mar 16, 2018

I think moving the utf8_unhtml call to EntityEmulator::set would restore things to the XenForo 2.0.2 state.

This would be better than what we've got in XenForo 2.0.3 but it'd still be broken: For example, post messages (and probably private message texts) are not stored HTML-encoded in vBulletin. So all intentional occurences of e.g. "&" (programming forum or something similar) in a post message would be turned into "&" if you call utf8_unhtml on everything. I think the appropriate place to call utf8_unhtml is on specific strings in the step* methods of the vBulletin importer class.

(Furthermore, I assume most forum software that you'll want to provide an importer for won't store HTML-encoded strings in the database like vBulletin does.)

Chris D · Mar 16, 2018

Yeah I think you're right here.

As it turns out, my claim about it doing the same in XF1 was probably incorrect.

As an example:

PHP:

$post['pagetext'] = $this->_convertToUtf8($post['pagetext']);

There's no second argument there, so the post content isn't unhtml'd.

Whereas something like username:

PHP:

'username' => $this->_convertToUtf8($post['username'], true),

The second argument is true, so we would unhtml it.

There's a logic error potentially in XF2 right now which basically means that unhtml will always be run with the $entities arg switched on (which is actually what is converting the entities).

Does the old logic above seem to be correct? For example, messages will not store HTML entities in vBulletin (but a user could as you say write them into code blocks etc.)? But titles etc. are (always?) converted to include HTML entities?

Steffen · Mar 16, 2018

Yes, applying unhtml on values from specific database table columns is the correct solution.

vBulletin stores post message texts (and I presume private message texts, in general everything that can use BBCode) as entered by the user / without HTML entities. The importer must not call "unhtml" on these values.

Basically everything else (especially the examples mentioned in #4) is stored with HTML entities. The importer has to decode ("unhtml") these values before putting them into the XenForo database.

Kier · Mar 16, 2018

I'm thinking that we should remove the utf8_unhtml call from convertToUtf8(), and instead add it as an option to EntityEmulator::set(), so we would end up with

PHP:

     public function set($field, $value, array $options = [])
    {
        $options = array_replace([
            'convertUtf8' => true,
            'forceConstraint' => true,
            'unHtml' => false
        ], $options);

Steffen · Mar 16, 2018

Yep, that sounds like a good approach!

Kier · Mar 16, 2018

(and in that way, we can selectively apply unHtml to specific fields on demand)

Kier · Mar 16, 2018

Okay, step one...

Diff:

Index: src/XF/Import/Data/EntityEmulator.php
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- src/XF/Import/Data/EntityEmulator.php    (revision 2751d0d4532870fd9dd4b0483268ec991c93e5e0)
+++ src/XF/Import/Data/EntityEmulator.php    (date 1521212333000)
@@ -27,6 +27,21 @@
 
     protected $oldId = null;
 
+    /**
+     * Convert text to UTF8 on set()
+     */
+    const CONVERT_UTF8 = 'convertUtf8';
+
+    /**
+     * Force type constraints on set()
+     */
+    const FORCE_CONSTRAINT = 'forceConstraint';
+
+    /**
+     * Convert HTML entities to raw values on set()
+     */
+    const UNHTML_ENTITIES = 'unHtmlEntities';
+
     public function __construct(
         AbstractData $handler, \XF\Mvc\Entity\Structure $structure, \XF\Mvc\Entity\ValueFormatter $valueFormatter
     )
@@ -50,8 +65,9 @@
     public function set($field, $value, array $options = [])
     {
         $options = array_replace([
-            'convertUtf8' => true,
-            'forceConstraint' => true
+            self::CONVERT_UTF8     => true,
+            self::FORCE_CONSTRAINT => true,
+            self::UNHTML_ENTITIES  => false
         ], $options);
 
         $columns = $this->structure->columns;
@@ -83,11 +99,11 @@
         $vf = $this->valueFormatter;
         $originalValue = $value;
 
-        if ($options['convertUtf8'])
+        if ($options[self::CONVERT_UTF8])
         {
             if ((is_string($value) && !preg_match('/./u', $value)) || (is_object($value) && is_callable([$value, '__toString'])))
             {
-                $value = $this->handler->convertToUtf8(strval($value));
+                $value = $this->handler->convertToUtf8(strval($value), null, $options[self::UNHTML_ENTITIES]);
             }
 
             try
@@ -112,7 +128,7 @@
             throw new \InvalidArgumentException($e->getMessage() . " [$field]", $e->getCode(), $e);
         }
 
-        if (!$vf->applyValueConstraints($value, $column['type'], $column, $error, $options['forceConstraint']))
+        if (!$vf->applyValueConstraints($value, $column['type'], $column, $error, $options[self::FORCE_CONSTRAINT]))
         {
             throw new \InvalidArgumentException("Constraint error for $field: " . $error);
         }

Steffen · Mar 16, 2018

Remember that "convertToUtf8" is only called for badly encoded strings (because of !preg_match('/./u', $value)) which for some reason was added in 2.0.3).

I think the unhtml call needs to be moved outside of "convertToUtf8".

Kier · Mar 16, 2018

Steffen said:
Remember that "convertToUtf8" is only called for badly encoded strings (because of !preg_match('/./u', $value)) which for some reason was added in 2.0.3).

I think the unhtml call needs to be moved outside of "convertToUtf8".

I'm going to leave it in there, because if we want to unhtml something that we are also utf8-converting, we would use the entities translation system from utf8_unhtml(), whereas if there is no utf8 conversion, we'd use htmlspecialchars_decode()

Kier · Mar 16, 2018

In EntityEmulator::set()...

PHP:

        if ($options[self::CONVERT_UTF8])
        {
            if ((is_string($value) && !preg_match('/./u', $value)) || (is_object($value) && is_callable([$value, '__toString'])))
            {
                $value = $this->handler->convertToUtf8(strval($value), null, $options[self::UNHTML_ENTITIES]);
            }

            try
            {
                $value = $vf->castValueToType($value, $column['type'], $column);
            }
            catch (\Exception $e)
            {
                if (is_string($originalValue) && !preg_match('/./u', $originalValue))
                {
                    $value = utf8_bad_replace($originalValue);
                }
            }
        }
        else if ($options[self::UNHTML_ENTITIES] && (is_string($value) || (is_object($value) && is_callable([$value, '__toString']))))
        {
            $value = htmlspecialchars_decode($value);
        }

Kier · Mar 16, 2018

hmmm no, that's no good is it?

Kier · Mar 16, 2018

Trying again...

PHP:

        if ($options[self::CONVERT_UTF8])
        {
            if ($this->isStringy($value))
            {
                if (!$this->isValidUtf8($value))
                {
                    $value = $this->handler->convertToUtf8(strval($value), null, $options[self::UNHTML_ENTITIES]);
                }
                else if ($options[self::UNHTML_ENTITIES])
                {
                    $value = htmlspecialchars_decode(strval($value));
                }
            }

            try
            {
                $value = $vf->castValueToType($value, $column['type'], $column);
            }
            catch (\Exception $e)
            {
                if (is_string($originalValue) && !$this->isValidUtf8($originalValue))
                {
                    $value = utf8_bad_replace($originalValue);
                }
            }
        }
        else if ($options[self::UNHTML_ENTITIES] && $this->isStringy($value))
        {
            $value = htmlspecialchars_decode(strval($value));
        }

Kier · Mar 16, 2018

Just need to go through the blog fields now and I think we can call this one fixed.

DragonByte Tech · Mar 16, 2018

Steffen said:
Remember that "convertToUtf8" is only called for badly encoded strings (because of !preg_match('/./u', $value)) which for some reason was added in 2.0.3).

"for some reason" = https://xenforo.com/community/threa...-xf-import-data-entityemulator-php-93.143897/

Fillip

Steffen · Mar 17, 2018

Thank you!

Fixed vB4 Import: Double HTML-encoding of thread titles

Well-known member

XenForo developer

Well-known member

Well-known member

XenForo developer

Well-known member

XenForo developer

Well-known member

XenForo developer

Well-known member

XenForo developer

XenForo developer

Well-known member

XenForo developer

XenForo developer

XenForo developer

XenForo developer

XenForo developer

Well-known member

Well-known member

Similar threads

We value your privacy