Fixed Indexing malformed UTF-8

Rasmus Vind

Well-known member
Affected version
2
This is going to be a very low priority bug report. The search system works great and this shouldn't scare anyone from using it.

You get the following error:
Code:
XFES\Elasticsearch\BulkRequestException: Elasticsearch indexing error: Elasticsearch bulk action error (first error: [resource-27292] failed to parse, document is empty) src/addons/XFES/Elasticsearch/Api.php:408 

Generated by: Ralle Nov 7, 2018 at 6:10 PM 

Stack trace

#0 src/addons/XFES/Elasticsearch/Api.php(171): XFES\Elasticsearch\Api->bulkRequest('{"index":{"_ind...')
#1 src/addons/XFES/Search/Source/Elasticsearch.php(82): XFES\Elasticsearch\Api->indexBulk(Array)
#2 src/addons/XFES/Search/Source/Elasticsearch.php(57): XFES\Search\Source\Elasticsearch->flushBulkIndexing()
#3 src/XF/Search/Search.php(40): XFES\Search\Source\Elasticsearch->index(Object(XF\Search\IndexRecord))
#4 src/XF/Search/Search.php(59): XF\Search\Search->index('resource', Object(VindIT\Repository\Entity\Resource))
#5 src/XF/Search/Search.php(85): XF\Search\Search->indexEntities('resource', Object(XF\Mvc\Entity\ArrayCollection))
#6 src/XF/Job/SearchRebuild.php(57): XF\Search\Search->indexRange('resource', 27203, '1000')
#7 src/XF/Job/Manager.php(241): XF\Job\SearchRebuild->run(8)
#8 src/XF/Job/Manager.php(187): XF\Job\Manager->runJobInternal(Array, 8)
#9 src/XF/Job/Manager.php(103): XF\Job\Manager->runJobEntry(Array, 8)
#10 src/XF/Admin/Controller/Tools.php(120): XF\Job\Manager->runByIds(Array, 8)
#11 src/XF/Mvc/Dispatcher.php(249): XF\Admin\Controller\Tools->actionRunJob(Object(XF\Mvc\ParameterBag))
#12 src/XF/Mvc/Dispatcher.php(88): XF\Mvc\Dispatcher->dispatchClass('XF:Tools', 'RunJob', 'html', Object(XF\Mvc\ParameterBag), 'tools', Object(XF\Admin\Controller\Tools), NULL)
#13 src/XF/Mvc/Dispatcher.php(41): XF\Mvc\Dispatcher->dispatchLoop(Object(XF\Mvc\RouteMatch))
#14 src/XF/App.php(1931): XF\Mvc\Dispatcher->run()
#15 src/XF.php(329): XF\App->run()
#16 admin.php(13): XF::runApp('XF\\Admin\\App')
#17 {main}
If you try to index malformed UTF-8. The call to json_encode returns false instead of valid JSON in this case resulting in an invalid request to ES:
Code:
[...]
    {"index":{"_index":"xenforo_full","_type":"resource","_id":27291}}\n
    {"title":"IceTrollWolfrider by FrancK IceTrollWolfrider by FrancK Unit Troll FrancK Truth Troll","message":"","date":1243380978,"user":0,"discussion_id":129188,"node":530,"bundle":129188,"resource_type":"Warcraft3_Model","review_state":"substandard","tag":[10,121],"author":["franck","truth_troll"],"animation":["portrait","portrait_2","portrait_3","portrait_talk","portrait_talk_2"],"has_team_color":true,"has_team_glow":true,"hidden":true}\n
    {"index":{"_index":"xenforo_full","_type":"resource","_id":27292}}\n
    \n
    {"index":{"_index":"xenforo_full","_type":"resource","_id":27293}}\n
    {"title":"Phenelect - The Godslayer Phenelect - The Godslayer Hero Unit Creep Human Undead Direfury","message":"","date":1243387273,"user":188037,"discussion_id":129194,"node":530,"bundle":129194,"resource_type":"Warcraft3_Model","review_state":"substandard","tag":[9,10,105,112,122],"author":["direfury"],"animation":["attack","attack_-_1","attack_-_3","death","dissipate","spell","spell_channel","stand","stand_victory","walk","stand_-_2"],"has_team_color":true,"has_team_glow":true}\n
    {"index":{"_index":"xenforo_full","_type":"resource","_id":27294}}\n
[...]

I shouldn't provide malformed UTF-8 in the first case, but I still feel like the handling inside of XFES is inelegant. One way to handle it would be to just skip indexing this record in case the call to json_encode fails. The reason for me to provide invalid JSON is because I parse binary files and validate the strings poorly.
 
Thank you for reporting this issue. The issue is now resolved and we are aiming to include that in a future XFES release (2.0.2).

Change log:
Handle situations where non-UTF-8 data is passed to Elasticsearch more gracefully.
Any changes made as a result of this issue being resolved may not be rolled out here until later.
 
Top Bottom