Fixed Search query tokenizer turns query "term1 - term2" into "term1 -term2" (i.e. it negates term2)

Steffen

Well-known member
Affected version
2.0.10
The search query tokenizer of XenForo turns the query "term1 - term2" into "term1 -term2" (it removes the whitespace in front of "term2" and therefore negates it). I think this is not intuitive and doesn't match the behaviour of Elasticsearch's "simple_query_string" tokenzier either.

This can be fixed as follows:

Diff:
--- a/src/XF/Search/Source/AbstractSource.php
+++ b/src/XF/Search/Source/AbstractSource.php
@@ -151,7 +151,6 @@ abstract class AbstractSource
         preg_match_all('/
             (?<=[' . $splitRange .'\-\+\|]|^)
             (?P<modifier>\-|\+|\||)
-            [' . $splitRange .']*
             (?P<term>"(?P<quoteTerm>[^"]+)"|[^' . $splitRange .'\-\+\|]+)
         /ix', $keywords, $matches, PREG_SET_ORDER);

I'm not 100% sure whether this could have unintended side-effects. Maybe whitespace should be allowed after "|" but not after "+" or "-".

PS: When using Elasticsearch, shouldn't XenForo just pass the raw query string to Elasticsearch and let its "simple_query_string" feature handle the query tokenization?
 
Last edited:
You need to filter queries before they hit simple_query_string, it lies about not throwing exceptions on bad inputs.

Even then, various versions barf in different ways depending on search terms so you need to sanitize inputs anyway.
 
Thank you for reporting this issue. The issue is now resolved and we are aiming to include that in a future XF release (2.0.12).

Change log:
Adjust how we parse serach query modifiers to be more strict. (- and + require whitespace before and none after, | requires whitespace on both sides. Don't parse doubled up modifiers.)
Any changes made as a result of this issue being resolved may not be rolled out here until later.
 
Top Bottom