Fixed vB4 Import: Double HTML-encoding of thread titles

Steffen

Well-known member
This was working fine in XenForo 2.0.2 but now fails in XenForo 2.0.3 with the Importers add-on. " is turned into " and so on.
 
We're not actually seeing that here with XF 2.0.3 and Importers 1.0.0.

Is this happening for all threads or only some? Do you have a more solid example of an original thread title and how it appears after import?

We don't believe we made any relevant changes here.
 
I'm not sure why I noticed the issue in XenForo 2.0.3 + XFI for the first time. I, too, cannot see which code change could have caused this issue. Maybe this issue was there before and I just didn't notice it?

vBulletin HTML-encodes some things before storing them in the database (for example usernames, thread titles and post titles but not post messages). Is there some code in the importer where this encoding is undone? I've just checked that usernames with special HTML characters aren't imported correctly, too: They are imported "as is" and therefore encoded a second time when XenForo displays them. For example, the username A&B is stored in vb_user as A&B and the importer just puts the very same pre-encoded value into xf_user.
 
Untested theory: Since XenForo 2.0.3 + XFI, in XF/Import/Data/EntityEmulator::set, the method XF\Import\DataManager::convertToUtf8 (which for some reason calls utf8_unhtml) is only called for invalidly encoded strings (because of the added !preg_match('/./u', $value)). So validly encoded strings aren't HTML-decoded anymore.

But this call to utf8_unhtml feels completely wrong. It needs to be called where appropriate and not on every string (otherwise an intentionally written "&" in a programming forum will be converted to "&").

Probably incomplete list:
  • vb_forum: title, description
  • vb_pmtext: title, fromusername
  • vb_post: title, username
  • vb_thread: title, postusername, lastposter, notes
  • vb_user: username
 
Last edited:
You're absolutely correct that it is happening there, due to that change. I noticed it a while ago. Incidentally, the nuts and bolts of the convertToUtf8 method haven't changed since it was written for XF1, and I don't recall it being problematic.

I'm tempted to remove it from there, but just call unhtml once we've got a valid string inside EntityEmulator::set.
 
I think moving the utf8_unhtml call to EntityEmulator::set would restore things to the XenForo 2.0.2 state.

This would be better than what we've got in XenForo 2.0.3 but it'd still be broken: For example, post messages (and probably private message texts) are not stored HTML-encoded in vBulletin. So all intentional occurences of e.g. "&" (programming forum or something similar) in a post message would be turned into "&" if you call utf8_unhtml on everything. I think the appropriate place to call utf8_unhtml is on specific strings in the step* methods of the vBulletin importer class.

(Furthermore, I assume most forum software that you'll want to provide an importer for won't store HTML-encoded strings in the database like vBulletin does.)
 
Last edited:
Yeah I think you're right here.

As it turns out, my claim about it doing the same in XF1 was probably incorrect.

As an example:
PHP:
$post['pagetext'] = $this->_convertToUtf8($post['pagetext']);
There's no second argument there, so the post content isn't unhtml'd.

Whereas something like username:
PHP:
'username' => $this->_convertToUtf8($post['username'], true),
The second argument is true, so we would unhtml it.

There's a logic error potentially in XF2 right now which basically means that unhtml will always be run with the $entities arg switched on (which is actually what is converting the entities).

Does the old logic above seem to be correct? For example, messages will not store HTML entities in vBulletin (but a user could as you say write them into code blocks etc.)? But titles etc. are (always?) converted to include HTML entities?
 
Yes, applying unhtml on values from specific database table columns is the correct solution.

vBulletin stores post message texts (and I presume private message texts, in general everything that can use BBCode) as entered by the user / without HTML entities. The importer must not call "unhtml" on these values.

Basically everything else (especially the examples mentioned in #4) is stored with HTML entities. The importer has to decode ("unhtml") these values before putting them into the XenForo database.
 
I'm thinking that we should remove the utf8_unhtml call from convertToUtf8(), and instead add it as an option to EntityEmulator::set(), so we would end up with
PHP:
     public function set($field, $value, array $options = [])
    {
        $options = array_replace([
            'convertUtf8' => true,
            'forceConstraint' => true,
            'unHtml' => false
        ], $options);
 
Okay, step one...
Diff:
Index: src/XF/Import/Data/EntityEmulator.php
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- src/XF/Import/Data/EntityEmulator.php    (revision 2751d0d4532870fd9dd4b0483268ec991c93e5e0)
+++ src/XF/Import/Data/EntityEmulator.php    (date 1521212333000)
@@ -27,6 +27,21 @@
 
     protected $oldId = null;
 
+    /**
+     * Convert text to UTF8 on set()
+     */
+    const CONVERT_UTF8 = 'convertUtf8';
+
+    /**
+     * Force type constraints on set()
+     */
+    const FORCE_CONSTRAINT = 'forceConstraint';
+
+    /**
+     * Convert HTML entities to raw values on set()
+     */
+    const UNHTML_ENTITIES = 'unHtmlEntities';
+
     public function __construct(
         AbstractData $handler, \XF\Mvc\Entity\Structure $structure, \XF\Mvc\Entity\ValueFormatter $valueFormatter
     )
@@ -50,8 +65,9 @@
     public function set($field, $value, array $options = [])
     {
         $options = array_replace([
-            'convertUtf8' => true,
-            'forceConstraint' => true
+            self::CONVERT_UTF8     => true,
+            self::FORCE_CONSTRAINT => true,
+            self::UNHTML_ENTITIES  => false
         ], $options);
 
         $columns = $this->structure->columns;
@@ -83,11 +99,11 @@
         $vf = $this->valueFormatter;
         $originalValue = $value;
 
-        if ($options['convertUtf8'])
+        if ($options[self::CONVERT_UTF8])
         {
             if ((is_string($value) && !preg_match('/./u', $value)) || (is_object($value) && is_callable([$value, '__toString'])))
             {
-                $value = $this->handler->convertToUtf8(strval($value));
+                $value = $this->handler->convertToUtf8(strval($value), null, $options[self::UNHTML_ENTITIES]);
             }
 
             try
@@ -112,7 +128,7 @@
             throw new \InvalidArgumentException($e->getMessage() . " [$field]", $e->getCode(), $e);
         }
 
-        if (!$vf->applyValueConstraints($value, $column['type'], $column, $error, $options['forceConstraint']))
+        if (!$vf->applyValueConstraints($value, $column['type'], $column, $error, $options[self::FORCE_CONSTRAINT]))
         {
             throw new \InvalidArgumentException("Constraint error for $field: " . $error);
         }
 
Remember that "convertToUtf8" is only called for badly encoded strings (because of !preg_match('/./u', $value)) which for some reason was added in 2.0.3).

I think the unhtml call needs to be moved outside of "convertToUtf8".
 
Remember that "convertToUtf8" is only called for badly encoded strings (because of !preg_match('/./u', $value)) which for some reason was added in 2.0.3).

I think the unhtml call needs to be moved outside of "convertToUtf8".
I'm going to leave it in there, because if we want to unhtml something that we are also utf8-converting, we would use the entities translation system from utf8_unhtml(), whereas if there is no utf8 conversion, we'd use htmlspecialchars_decode()
 
In EntityEmulator::set()...
PHP:
        if ($options[self::CONVERT_UTF8])
        {
            if ((is_string($value) && !preg_match('/./u', $value)) || (is_object($value) && is_callable([$value, '__toString'])))
            {
                $value = $this->handler->convertToUtf8(strval($value), null, $options[self::UNHTML_ENTITIES]);
            }

            try
            {
                $value = $vf->castValueToType($value, $column['type'], $column);
            }
            catch (\Exception $e)
            {
                if (is_string($originalValue) && !preg_match('/./u', $originalValue))
                {
                    $value = utf8_bad_replace($originalValue);
                }
            }
        }
        else if ($options[self::UNHTML_ENTITIES] && (is_string($value) || (is_object($value) && is_callable([$value, '__toString']))))
        {
            $value = htmlspecialchars_decode($value);
        }
 
Trying again...
PHP:
        if ($options[self::CONVERT_UTF8])
        {
            if ($this->isStringy($value))
            {
                if (!$this->isValidUtf8($value))
                {
                    $value = $this->handler->convertToUtf8(strval($value), null, $options[self::UNHTML_ENTITIES]);
                }
                else if ($options[self::UNHTML_ENTITIES])
                {
                    $value = htmlspecialchars_decode(strval($value));
                }
            }

            try
            {
                $value = $vf->castValueToType($value, $column['type'], $column);
            }
            catch (\Exception $e)
            {
                if (is_string($originalValue) && !$this->isValidUtf8($originalValue))
                {
                    $value = utf8_bad_replace($originalValue);
                }
            }
        }
        else if ($options[self::UNHTML_ENTITIES] && $this->isStringy($value))
        {
            $value = htmlspecialchars_decode(strval($value));
        }
 
Top Bottom