XF 2.2 Spam filtering based on character sets

Wildcat Media

Well-known member
We already have some detection set up to catch any URLs posted in posts from new members.

Is there a way to grab non-English characters? We see spam from Korea, China, etc. that sometimes doesn't use a URL, and they'll sometimes post an attachment with information (like an Instagram ID).

I wondered if there was some way to detect those characters using regex. One forum I'm helping out in has about 20-25 spam posts in a row, obviously a spambot, posted just over one minute apart to bypass the flood control.
 
You would just use regex in the same way you do for the English alphabet.

Code:
/(다|모|아|카|지|노|ざ|경|륜|두|파|칭|코|ぉ)/ui
 
Is there a way to include the entire character set or at least the bulk of it, using a range? I know I could specify ^[a-zA-Z0-9]+$ (I think you get the idea--my syntax isn't perfect) if I wanted to include the alphanumeric characters shown. Not knowing the other languages, I wouldn't even know if there is such a thing, or we'd have to do it hit-or-miss by picking out random characters from each language and hoping some of them worked.
 
This will grab the range of 256 non-extended ASCII characters:

/[^\000-\xff]/ui

But...I wonder if I tried to trap characters outside the extended ASCII set, are emoji affected by that? They can be excluded with /[^\x{1F601}-\x{1f699}]/ui provided I find out the correct range(s) of hex values for common emoji. But otherwise, a valid post with an emoji will still get trapped in the spam filter if I don't allow for them. (I don't need to grab all of them, but the most common would be good enough.)

A work in progress here... 1F601 = 😁.
 
OK, this bit of regex craziness seems to let emoji and some other common symbols pass, but grabs other characters which may indicate spam:

/[^\x00-\xff\x{1f300}-\x{1f9ff}\x{2116}-\x{215e}\x{2600}-\x{27ff}]/ui

This site is great for testing regex:


(When testing with a lot of different characters, make your suffix /uig to catch all instances, not just the first one. But in XF's spam filter, you'll likely want /ui.)

1606250629737.webp
 
Top Bottom