As designed spam phrases: won't match 微信

rebelde · Jun 28, 2015

I tried to block 微信 in the Spam Phrases, but I couldn't get it to block the posts.

The work-around is to make it a regular expression:
/微信/u

Additional documentation suggestion:
Also (as if you didn't have enough things to do!), I recommend that you create more extensive documentation and link to it from the adminCP, especially for Unicode matching. This Regex match seems to work without the /u, but others require it. It was not easy to figure this out.

Chris D · Jun 28, 2015

This seems to work as expected for me:

Each spam phrase that isn't already a regex is normalised into one, e.g. the regex that runs on that example above is:

Code:

#(?<=\W|^)(微信)(?=\W|$)#iu

Note the unicode modifer at the end.

Are you certain, in your testing, it was with a user who will have their messages checked for spam? e.g. moderator/admin users and users who have exceeded certain criteria will not have their messages checked for spam.

rebelde · Jun 28, 2015

I'm very surprised that you can't replicate this. It happens on both my test forum and my active forums.

Yes, using the same user with 6 posts (my limit is 10), I edit a post.

If it has 微信, it does not match - edit allowed
If it is /微信/u it matches and blocks - edit blocked.

I change it back and forth, over and over and get the same results.

Chris, I can let you into my test forums if it helps.

Chris D · Jun 29, 2015

Checking it on your test forum may be useful.

Submit a ticket from your customer area with details and I will take a quick look.

rebelde · Jun 30, 2015

Ticket submitted. Thanks.

Chris D · Jul 1, 2015

Just an update on this.

I found a specific reproduction case.

I found that something like:

test 微信 test test blah

Would work fine and the message would be rejected accordingly.

But:

微信2

Would cause the match to fail and the post be allowed.

I haven't looked into the specifics, yet,

Mike · Jul 1, 2015

If you block "test", it won't match "test2" as it looks for non-word characters afterwards. In CJK languages this is potentially problematic, but I'm not aware of a definitive way to have it work as expected in both cases.

rebelde · Jul 1, 2015

Here are a few ideas. The last one (#3) is the easiest:

1. You could test for CJK characters first. If the phrase has CJK, then match the string instead of the word.
We currently use this to catch any Korean due to Korean spam: /[가-힣]/u
You could probably expand that to match Chinese and Japanese.

2. Or just give an error or warning when somebody enters non-regex CJK into the Spam Phrases: "Only use regular expressions for CJK phrases."
3. Just change the text in the AdminCP: "Only use regular expressions for CJK phrases."

Additional text that can reduce confusion: "Use regular expressions to match a string of characters. Phrases without regular expressions will only match exact words."

Mike · Jul 6, 2015

You don't actually have to use a regex -- simply using * around the words is sufficient. The system is basically the same as censoring in this regard (in terms of word detection).

Based on this, I don't think anything is going to explicitly be changed here. I believe this is the only time we've actually had a question surrounding this (regarding CJK) in either censoring or spam prevention.

cclaerhout · Jul 6, 2015

@rebelde

If you want to target any sentences having in them the word 微信, you can try this:
Code:
```
/(?<!["「«])\b(?:\w+)?微信(?:\w+)?\b/iu
```
=> this regex will still accept your members to quote those characters ; ie:
Code:
```
This word "微信"
```

If you want to target any CJK sentences having in them the word 微信, you can try this:

Code:

/\b(?:[0-9\p{Han}\p{Bopomofo}\p{Hiragana}\p{Katakana}\p{Hangul}０-９Ａ-ｚ]+)?微信(?:[0-9\p{Han}\p{Bopomofo}\p{Hiragana}\p{Katakana}\p{Hangul}０-９Ａ-ｚ]+)?\b/iu

This regex will accept this kind of text though:

Code:

微信abc

If you want to prevent any CJK words in your board (and full width characters), you can try this:
Code:
```
/\b(?:\w+?)?[\p{Han}\p{Bopomofo}\p{Hiragana}\p{Katakana}\p{Hangul}０-９Ａ-ｚ]+(?:\w+?)?\b/iu
```

rebelde · Jul 6, 2015

Thanks Cédric (and everybody). We like CJK on our board, we just need to keep out the spammers. The Chinese ones mention QQ/微信, so we are blocking that. We haven't found an easy pattern to the Korean ones, so we match against the Unicode range: /[가-힣]/u I would test your {Hangul} Regex, but what we have is working well.

We allow all this, but it is all sent to moderation to see if it is spam. It works great, as long as I can remember how the non-regex matches work...

As designed spam phrases: won't match 微信

rebelde

Active member

Chris D

XenForo developer

rebelde

Active member

Chris D

XenForo developer

rebelde

Active member

Chris D

XenForo developer

Mike

XenForo developer

rebelde

Active member

Mike

XenForo developer

cclaerhout

Well-known member

rebelde

Active member

Similar threads

We value your privacy