Fixed vB4 Import: InvalidArgumentException: Received invalid UTF-8 for string column [username] src/XF/Import/Data/EntityEmulator.php:93

Affected version
2.0.2

DragonByte Tech

Well-known member
#1
  • InvalidArgumentException: Received invalid UTF-8 for string column [username]

  • src/XF/Import/Data/EntityEmulator.php:93
  • Generated by: Unknown account

  • Mar 7, 2018 at 1:28 PM
Stack trace
#0 src/XF/Import/Data/User.php(123): XF\Import\Data\EntityEmulator->set('username', 'Heo Huy\xC3\xA1\xC2\xBB\xEF\xBF\xBEn...', Array)
#1 src/XF/Import/Data/AbstractData.php(292): XF\Import\Data\User->set('username', 'Heo Huy\xE1\xBB\x81n Tho...')
#2 src/XF/Import/Importer/vBulletin.php(1030): XF\Import\Data\AbstractData->__set('username', 'Heo Huy\xE1\xBB\x81n Tho...')
#3 src/XF/Import/Importer/vBulletin.php(988): XF\Import\Importer\vBulletin->setupImportUser(Array, Object(XF\Import\StepState), Array)
#4 src/XF/Import/Runner.php(160): XF\Import\Importer\vBulletin->stepUsers(Object(XF\Import\StepState), Array, 8)
#5 src/XF/Import/Runner.php(74): XF\Import\Runner->runStep('users', Object(XF\Import\StepState), 8)
#6 src/XF/Cli/Command/Import.php(66): XF\Import\Runner->run()
#7 src/vendor/symfony/console/Command/Command.php(242): XF\Cli\Command\Import->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#8 src/vendor/symfony/console/Application.php(843): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#9 src/vendor/symfony/console/Application.php(194): Symfony\Component\Console\Application->doRunCommand(Object(XF\Cli\Command\Import), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#10 src/vendor/symfony/console/Application.php(117): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 src/XF/Cli/Runner.php(63): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 cmd.php(15): XF\Cli\Runner->run()
#13 {main}
I am trying to import from a iso-8859-1 charset (language) / latin1_swedish_ci (database) to utf8mb4.

Fillip
 

Chris D

XenForo developer
Staff member
#2
We did receive another report of this after we fixed a similar bug in 2.0.1:

Downloaded 2.0.1, had the same problem today:

InvalidArgumentException: Received invalid UTF-8 for string column [username] in src/XF/Import/Data/EntityEmulator.php at line 79.

Importing ~140,000 users from vBulletin 4. Character set on the user table is Latin1; running a script to check, about 200 of the usernames do not match UTF-8 via mb_check_encoding().

Tried import with ISO-8859-1, Latin1, and UTF-8 character sets, failed each time with the above.
Is there anyway that you can send me a dump of the usernames and I'll see if I can manage to reproduce this? We had a test case for this and the changes did fix it, but it's possible there are other situations that could trigger it.
I don’t believe that user ever got back to us though so if you can provide the details Mike asked for then it will aid in troubleshooting.
 

DragonByte Tech

Well-known member
#4
I'm looking into this myself, as continued work on eCommerce depends on this import at least completing, and I've found two problematic users in particular: In the dump I sent you, look at user ID 622 and 888 specifically.

If I am reading this dump correctly, 622 is correctly encoded, but 888 was somehow UTF-8 encoded already.

I tested this hypothesis by changing src/XF/Import/Importer/vBulletin.php roughly around line 1030 to the following:

PHP:
        $options = [];
        if (preg_match('/./u', $user['username']))
        {
            $options['convertUtf8'] = false;
        }
       
        // username
        $import->set('username', $user['username'], $options);
That worked.

However, the same issue occurs on the custom field "website" for another user, so I would say a more permanent fix would be to change the relevant code in src/XF/Import/Data/EntityEmulator.php to:

PHP:
        if ($options['convertUtf8'])
        {
            if ((is_string($value) && !preg_match('/./u', $value)) || (is_object($value) && is_callable([$value, '__toString'])))
            {
                $value = $this->handler->convertToUtf8(strval($value));
            }
        }
That resolved the error for user names, and imported both usernames of 622 and 888 correctly. However, it shifted the error to
InvalidArgumentException: Received invalid UTF-8 for string column [about] in src/XF/Import/Data/EntityEmulator.php at line 93
I'll continue investigating.


Fillip
 

DragonByte Tech

Well-known member
#5
After some experiments, the problem in this particular case is that the string is aaaalmost valid UTF-8 but not valid enough to satisfy the /./u regexp.

String in question:
د/ابراهيم فقي:يمكن للرجل أن يخفق مرات عديدة, لكنه لا يصب�* فاشلا إلا �*ين يبدأ في لوم الآخرين على إخفاقه علم النفس, Global Positioning System العمل التطوعي, البرمجة لا ت�*زن
(Hopefully this is not a rude message in Arabic...)

In other words, I believe that relying on the /./u regexp is sub-optimal. I'll continue to experiment.

EDIT: Changing the relevant block in EntityEmulator.php to
PHP:
        if ($options['convertUtf8'])
        {
            if (is_string($value))
            {
                $value = utf8_bad_replace($value);
            }
            
            if ((is_string($value) && !preg_match('/./u', $value)) || (is_object($value) && is_callable([$value, '__toString'])))
            {
                $value = $this->handler->convertToUtf8(strval($value));
            }
        }
seems to have corrected the string.

I'll continue the import and see if this causes any further issues.


Fillip
 
Last edited:

DragonByte Tech

Well-known member
#6
Turns out, that wasn't the correct fix either, as it killed valid UTF-8 because the string wasn't UTF-8 encoded.

This, however, appears to be a valid fix:
Diff:
diff -r /Users/filliph/Downloads/xenforo_2.0.2_1921634965_full/upload/src/XF/Import/Data/EntityEmulator.php /Users/filliph/Downloads/EntityEmulator.php
76a77,78
>         $vf = $this->valueFormatter;
>         $originalValue = $value;
79c82
<             if (is_string($value) || (is_object($value) && is_callable([$value, '__toString'])))
---
>             if ((is_string($value) && !preg_match('/./u', $value)) || (is_object($value) && is_callable([$value, '__toString'])))
82a87,89
>             try
>             {
>                 $value = $vf->castValueToType($value, $column['type'], $column);
84,86c91,98
<
<         $vf = $this->valueFormatter;
<
---
>             catch (\Exception $e)
>             {
>                 if (is_string($originalValue) && !preg_match('/./u', $originalValue))
>                 {
>                     $value = utf8_bad_replace($originalValue);
>                 }
>             }
>         }
If casting fails, I perform the utf8_bad_replace then pass it along down the code. I've tested this with the two users above, as well as the Arabic string above, and it appears to store all three of these as valid strings in the database.


Fillip
 

Fethi.dz

Active member
#7
Hello Fillip,

I know it’s hard to work with a language that you don’t know, it’s like walking with closed eyes :X3:

Anyway, I can tell you the above sentence doesn’t contain any rude words :p but I noticed that every alphabet ح is replaced with �* which is something you might need to look at.

It should be like this:
د/ابراهيم فقي:يمكن للرجل أن يخفق مرات عديدة, لكنه لا يصبح فاشلا إلا حين يبدأ في لوم الآخرين على إخفاقه علم النفس, Global Positioning System العمل التطوعي, البرمجة لا تحزن
Good luck (y)
 
Last edited:

DragonByte Tech

Well-known member
#8
Hello Fillip,

I know it’s hard to work with a language that you don’t know, it’s like walking with closed eyes :X3:

Anyway, I can tell you the above sentence doesn’t contain any rude words :p
Hehe, thanks :D

I think the problem was that somehow, the message was saved with UTF-8 characters even though vBulletin 4 and the database doesn't support UTF-8. I'm not sure if there's anything we could do to change that to be honest 🤔

In most cases, if an Arabic speaking site was running vB4, they would probably be using UTF-8 in the database already, or at least not latin1_swedish_ci like we are @ DBTech.


Fillip
 

Chris D

XenForo developer
Staff member
#9
Turns out, that wasn't the correct fix either, as it killed valid UTF-8 because the string wasn't UTF-8 encoded.

This, however, appears to be a valid fix:
Diff:
diff -r /Users/filliph/Downloads/xenforo_2.0.2_1921634965_full/upload/src/XF/Import/Data/EntityEmulator.php /Users/filliph/Downloads/EntityEmulator.php
76a77,78
>         $vf = $this->valueFormatter;
>         $originalValue = $value;
79c82
<             if (is_string($value) || (is_object($value) && is_callable([$value, '__toString'])))
---
>             if ((is_string($value) && !preg_match('/./u', $value)) || (is_object($value) && is_callable([$value, '__toString'])))
82a87,89
>             try
>             {
>                 $value = $vf->castValueToType($value, $column['type'], $column);
84,86c91,98
<
<         $vf = $this->valueFormatter;
<
---
>             catch (\Exception $e)
>             {
>                 if (is_string($originalValue) && !preg_match('/./u', $originalValue))
>                 {
>                     $value = utf8_bad_replace($originalValue);
>                 }
>             }
>         }
If casting fails, I perform the utf8_bad_replace then pass it along down the code. I've tested this with the two users above, as well as the Arabic string above, and it appears to store all three of these as valid strings in the database.


Fillip
Would you mind posting the changed block of code - the diff is quite confusing and won't import because of the paths etc. I think I've repro'd what the diff is saying manually but just want to make sure.
 

DragonByte Tech

Well-known member
#10
This is the entire set function from EntityEmulator.

PHP:
public function set($field, $value, array $options = [])
{
    $options = array_replace([
        'convertUtf8' => true,
        'forceConstraint' => true
    ], $options);

    $columns = $this->structure->columns;
    $column = $columns[$field];

    if (isset($columns[$field]))
    {
        if (is_null($value) && empty($column['nullable']))
        {
            $value = $this->getValidEmptyValue($column['type']);
        }
        else if ($column['type'] == Entity::STR && empty($column['nullable']))
        {
            // TODO: this does mean we can't have leading whitespace at all, but perhaps that's not a bad thing
            $value = ltrim(strval($value));
        }
    }
    else
    {
        throw new \InvalidArgumentException("Unknown column '$field'");
    }
    
    $vf = $this->valueFormatter;
    $originalValue = $value;
    
    if ($options['convertUtf8'])
    {
        if ((is_string($value) && !preg_match('/./u', $value)) || (is_object($value) && is_callable([$value, '__toString'])))
        {
            $value = $this->handler->convertToUtf8(strval($value));
        }
        
        try
        {
            $value = $vf->castValueToType($value, $column['type'], $column);
        }
        catch (\Exception $e)
        {
            if (is_string($originalValue) && !preg_match('/./u', $originalValue))
            {
                $value = utf8_bad_replace($originalValue);
            }
        }
    }
    
    try
    {
        $value = $vf->castValueToType($value, $column['type'], $column);
    }
    catch (\Exception $e)
    {
        throw new \InvalidArgumentException($e->getMessage() . " [$field]", $e->getCode(), $e);
    }

    if (!$vf->applyValueConstraints($value, $column['type'], $column, $error, $options['forceConstraint']))
    {
        throw new \InvalidArgumentException("Constraint error for $field: " . $error);
    }

    $this->entityData[$field] = $value;

    return true;
}
The main change is inside if ($options['convertUtf8']) but the $vf assignment was also moved up to accommodate the change.


Fillip
 
Top