As designed spam phrases: won't match 微信

rebelde

Active member
I tried to block 微信 in the Spam Phrases, but I couldn't get it to block the posts.

The work-around is to make it a regular expression:
/微信/u

Additional documentation suggestion:
Also (as if you didn't have enough things to do!), I recommend that you create more extensive documentation and link to it from the adminCP, especially for Unicode matching. This Regex match seems to work without the /u, but others require it. It was not easy to figure this out.
 
This seems to work as expected for me:

upload_2015-6-28_16-17-0.webp

upload_2015-6-28_16-17-24.webp

Each spam phrase that isn't already a regex is normalised into one, e.g. the regex that runs on that example above is:

Code:
#(?<=\W|^)(微信)(?=\W|$)#iu

Note the unicode modifer at the end.

Are you certain, in your testing, it was with a user who will have their messages checked for spam? e.g. moderator/admin users and users who have exceeded certain criteria will not have their messages checked for spam.
 
I'm very surprised that you can't replicate this. It happens on both my test forum and my active forums.

Yes, using the same user with 6 posts (my limit is 10), I edit a post.

If it has 微信, it does not match - edit allowed
If it is /微信/u it matches and blocks - edit blocked.

I change it back and forth, over and over and get the same results.

Chris, I can let you into my test forums if it helps.
 
Checking it on your test forum may be useful.

Submit a ticket from your customer area with details and I will take a quick look.
 
Just an update on this.

I found a specific reproduction case.

I found that something like:

test 微信 test test blah

Would work fine and the message would be rejected accordingly.

But:


Would cause the match to fail and the post be allowed.

I haven't looked into the specifics, yet,
 
If you block "test", it won't match "test2" as it looks for non-word characters afterwards. In CJK languages this is potentially problematic, but I'm not aware of a definitive way to have it work as expected in both cases.
 
Here are a few ideas. The last one (#3) is the easiest:

1. You could test for CJK characters first. If the phrase has CJK, then match the string instead of the word.
We currently use this to catch any Korean due to Korean spam: /[가-힣]/u
You could probably expand that to match Chinese and Japanese.

2. Or just give an error or warning when somebody enters non-regex CJK into the Spam Phrases: "Only use regular expressions for CJK phrases."
3. Just change the text in the AdminCP: "Only use regular expressions for CJK phrases."

Additional text that can reduce confusion: "Use regular expressions to match a string of characters. Phrases without regular expressions will only match exact words."
 
You don't actually have to use a regex -- simply using * around the words is sufficient. The system is basically the same as censoring in this regard (in terms of word detection).

Based on this, I don't think anything is going to explicitly be changed here. I believe this is the only time we've actually had a question surrounding this (regarding CJK) in either censoring or spam prevention.
 
@rebelde
  • If you want to target any sentences having in them the word 微信, you can try this:
    Code:
    /(?<!["「«])\b(?:\w+)?微信(?:\w+)?\b/iu
    => this regex will still accept your members to quote those characters ; ie:
    Code:
    This word "微信"
  • If you want to target any CJK sentences having in them the word 微信, you can try this:
    Code:
    /\b(?:[0-9\p{Han}\p{Bopomofo}\p{Hiragana}\p{Katakana}\p{Hangul}0-9A-z]+)?微信(?:[0-9\p{Han}\p{Bopomofo}\p{Hiragana}\p{Katakana}\p{Hangul}0-9A-z]+)?\b/iu
    This regex will accept this kind of text though:
    Code:
    微信abc
  • If you want to prevent any CJK words in your board (and full width characters), you can try this:
    Code:
    /\b(?:\w+?)?[\p{Han}\p{Bopomofo}\p{Hiragana}\p{Katakana}\p{Hangul}0-9A-z]+(?:\w+?)?\b/iu
 
Thanks Cédric (and everybody). We like CJK on our board, we just need to keep out the spammers. The Chinese ones mention QQ/微信, so we are blocking that. We haven't found an easy pattern to the Korean ones, so we match against the Unicode range: /[가-힣]/u I would test your {Hangul} Regex, but what we have is working well.

We allow all this, but it is all sent to moderation to see if it is spam. It works great, as long as I can remember how the non-regex matches work...
 
Top Bottom