XF 2.1 Link/BBCode processing in a post, and spam detection

Wildcat Media

Well-known member
We have been using regex in "Spam Phrases" to find any link pasted into a site where the string starts with https:// or http:// and it has worked well. /^https?:\/\/\S+\n/si

I found a new angle. I found that if someone pastes in a link to a site, like www.xenforo.com without the https:// or http:// prefix, it is still being pasted into a post as a hyperlink. Here's one I found twice already (the BBCode view):

This is me favorite vapeshop [URL='http://www.aquavape.co.uk']www.aquavape.co.uk[/URL]

I tried a test post and in fact, I can do that here if I type in www.xenforo.com -- it will automatically morph into a hyperlink.

I would like to eliminate this source of spam. Now that they've detected how some of us are preventing spam, we need to mitigate this. There are two ideas I came up with:
  1. Can we somehow disable this automatic hyperlinking when it is not preceded by http or https? (I'm not sure if it's a function of the editor, or something in XenForo that processes it.)
  2. Does XenForo process the Spam Phrases after BBCode is generated? I'm thinking that if this is the case, could we filter on the [url= BBCode to trap the spam?
Or, any other ideas?
 
You can add multiple strings to the spam phrases to catch other instances.

For example:
Code:
/\[url=("|')?([^"'\]]+)("|')?\].*\[url\]\2\[/si
/\[url=("|')?([^"'\]]+)("|')?\].*\[url=("|')?\2("|')?\]/si
/^[a-z0-9-]+\.(com|net|org)\/\S+\n/si
 
You can add multiple strings to the spam phrases to catch other instances.

For example:
Code:
/\[url=("|')?([^"'\]]+)("|')?\].*\[url\]\2\[/si
/\[url=("|')?([^"'\]]+)("|')?\].*\[url=("|')?\2("|')?\]/si
/^[a-z0-9-]+\.(com|net|org)\/\S+\n/si
That's an interesting batch of regex. ;) The third one makes sense (and I would add quite a few TLDs to that list). The first two seem to capture the [url=] in posts. Those will give me a few options to try in our spam filtering. Thanks much!
 
Code:
/\[url=("|')?([^"'\]]+)("|')?\].*\[url\]\2\[/si
/\[url=("|')?([^"'\]]+)("|')?\].*\[url=("|')?\2("|')?\]/si
/^[a-z0-9-]+\.(com|net|org)\/\S+\n/si
/^https?:\/\/\S+\n/si

So I have this in Spam phrases. And Maximum messages to check for spam is set at 50. What happens with this configuration? Any member with less than 50 posts. Any post of theirs with a link in it goes to moderation?
 
Thanks. Is there any configuration that would moderate posts containing particular keywords without any post limits? Censor does not work here because it replaces keywords.
 
I am sorry to take a bit more of your time. Just one more query on same topic. What does this do? Submit content without approval in group permissions... If I disable it for a user, all his posts go to moderation? Or is this connected to node's moderation settings as well?

Thanks a lot!
 
It allows members to bypass the moderation queue.

We use it here for verified license holders who may otherwise be caught by the spam phrases when posting the first few messages.
 
This is weird. I haven't seen any post landing in moderation since I added these to spam phrases box. So I decided to make a test account and test through it. I posted following content (to test all three types of links)

Code:
[url=https://google.com/]Google[/url]
https://google.com
www.google.com

And the post was published instantly. Didn't land in moderation. I wonder what I am doing wrong. :confused:
 
I am kind of baffled now. Shouldn't it work if I just add [url to the spam phrases? It seems to but OP says it doesn't! Was this behavior modified in 2.2? Thanks!
 
I was revisiting this today and I came to the same conclusion--the above regex is not catching any example starting with [url . However, I was playing around at Regex 101 and came up with this:

/\[url(.*)\]/is

And that seems to work...as regex. Untested in XF 2.2 though. The other examples above did not work at Regex 101. Essentially it will catch [url followed by a closing ]. So any combination of URL and quotes, it seems to catch it. Maybe someone with more regex knowledge than myself can improve on it.

Likewise, /^https?:\/\/\S+\n/si still seems to work.

So I also might guess that the spam filtering is done after the post is parsed with BBCode? I mean, some spammers out there may just dump in text containing a link, but a few others might include the BBCode in their post text, knowing that most forums will render it.
 
Back
Top Bottom