XF 2.2 Spam filtering based on character sets

Wildcat Media

Well-known member
We already have some detection set up to catch any URLs posted in posts from new members.

Is there a way to grab non-English characters? We see spam from Korea, China, etc. that sometimes doesn't use a URL, and they'll sometimes post an attachment with information (like an Instagram ID).

I wondered if there was some way to detect those characters using regex. One forum I'm helping out in has about 20-25 spam posts in a row, obviously a spambot, posted just over one minute apart to bypass the flood control.
 

Brogan

XenForo moderator
Staff member
You would just use regex in the same way you do for the English alphabet.

Code:
/(다|모|아|카|지|노|ざ|경|륜|두|파|칭|코|ぉ)/ui
 

Wildcat Media

Well-known member
Is there a way to include the entire character set or at least the bulk of it, using a range? I know I could specify ^[a-zA-Z0-9]+$ (I think you get the idea--my syntax isn't perfect) if I wanted to include the alphanumeric characters shown. Not knowing the other languages, I wouldn't even know if there is such a thing, or we'd have to do it hit-or-miss by picking out random characters from each language and hoping some of them worked.
 

Wildcat Media

Well-known member
This will grab the range of 256 non-extended ASCII characters:

/[^\000-\xff]/ui

But...I wonder if I tried to trap characters outside the extended ASCII set, are emoji affected by that? They can be excluded with /[^\x{1F601}-\x{1f699}]/ui provided I find out the correct range(s) of hex values for common emoji. But otherwise, a valid post with an emoji will still get trapped in the spam filter if I don't allow for them. (I don't need to grab all of them, but the most common would be good enough.)

A work in progress here... 1F601 = 😁.
 

Wildcat Media

Well-known member
OK, this bit of regex craziness seems to let emoji and some other common symbols pass, but grabs other characters which may indicate spam:

/[^\x00-\xff\x{1f300}-\x{1f9ff}\x{2116}-\x{215e}\x{2600}-\x{27ff}]/ui

This site is great for testing regex:


(When testing with a lot of different characters, make your suffix /uig to catch all instances, not just the first one. But in XF's spam filter, you'll likely want /ui.)

1606250629737.png
 
Top