Regex to prevent url's in custom user fields

Fred.

Well-known member
Hi,

I'm trying to prevent users / bots from entering url's or even better hidden url's in custom user fields, the solution is a regex. But I don't understand Regex, can someone help me with that?

Thanks in advance :)
 
I have a very basic one which will disallow http

Code:
^((?!http).)*$

Not sure about https though. I also don't rally understand regex that well so hopefully someone helpful who does will come along with something better.

Worth trying this

Code:
^(http|https)://
 
Last edited:
Thanks, but they are smart... That's not going to be enough to stop them. I need something more advanced.
 
I spent a little while looking into a regex for you, it's really not that easy (for me anyway)

firstly it has to not match if it contains certain strings (forexample "www", "http", ".com"), this in it's self seems fairly non regular (negative look aheads)
it has to do this for multiple strings, I was hoping @Mike the regex wizard would help you

This is what I have come up with so far:

^(?!.*(www|http|.com)).*$

you can test it here:
https://regex101.com/r/loPLBF/1

The way it is written makes it fairly easy to build upon for your own ideas

You might want to make it case insensitive and what ever else the regex experts come up with (I am by no means one)

There is a whole bunch or regex ideas for url detection here
https://mathiasbynens.be/demo/url-regex
 
Last edited:
For spam, I guess the most common url extensions are

.net, .ru, .org, .com, .biz

so the regex can easily be extended to:

Code:
^(?!.*(www|http|.com|.net|.org|.ru|.biz)).*$

I'm not currently sure how you make a negative look ahead case insensitive, (usually you can stick an i on the end of the encapsulating symbols, but it wont work here)

I guess the hacky way would be to stick in all the case variations, until someone comes up with a better regex solution
 
Last edited:
Thanks guys, I know this is not easy. I could use that but it's far from perfect and easy to get around.
Ah, I didn't think about lower and uppercase, Then I have to add .com|.COM|.Com|.COm|.cOM|coM just for one tld.
I was messing around with it and found out that (.*) should prevent lower and uppercase, but it doesn't in this case o_O:confused:
I still hope someone comes up with a better regex. I believe @Jake Bunce and @EQnoble are also very good with regex. :D
I'm surprised no one else is using this to prevent spam.
 
I would argue that you'd probably be better using a callback function. Regexes work better at positive matching than negative. You could write this much more clearly using a callback as you can reject a positive match for whatever URL definition you want to use. It'd likely be much clearer.

That said, with a simple list, this may work:
Code:
^(?>(?i)(?!www|http|\.com|\.net|\.org|\.biz|\.ru).)*$
Essentially, before it reads each character it ensures that it's not about to read a disallowed string. The "interesting" bits are:
  • The once-only subpattern flag. This single character approach is brutal for performance. Note that this still doesn't handle "giant" strings because of the recursion this method triggers. It did handle up to 88000 characters in my test. (Without this, it basically crashes way earlier.)
  • The (?i) part enables case insensitivity for that part of the match.
 
The problem with this approach is that it does not inform users what they are doing wrong. They get a vague error about fields that they do not understand. Therefore a valid user does not know how to fix.
Another problem is that this does not only relate to new signups, but also affects valid users who edit their profile and find they cannot update their account for an unknown reason. Even if their profile fields already had an internal url (which happens a lot)
 
The problem with this approach is that it does not inform users what they are doing wrong. They get a vague error about fields that they do not understand. Therefore a valid user does not know how to fix.
Another problem is that this does not only relate to new signups, but also affects valid users who edit their profile and find they cannot update their account for an unknown reason. Even if their profile fields already had an internal url (which happens a lot)

Isn't the point that you don't want them to know what they are doing wrong?

I have a custom field called "Country of Residence"

It is required. But spammers try to put a URL there as opposed to a country. I just want them to be disallowed, sad, and not registered as opposed to knowing what they did wrong.

This is really to deter quite amateurish human spammers, of which we had a lot and it has worked quite well.
 
Isn't the point that you don't want them to know what they are doing wrong?
Example:
Someone has been a contributing member of your forum for 5 years. In one of his profile fields there is a link to a thread he likes. This link has been there the entire time.
Then you add the above regex.
This valid member edits their profile and encounters a vague error which he doesnt understand. He cant save his profile anymore and doesn't know what to change.
The only thing he can do is leave your website. You lost a valid member.
 
Example:
Someone has been a contributing member of your forum for 5 years. In one of his profile fields there is a link to a thread he likes. This link has been there the entire time.
Then you add the above regex.
This valid member edits their profile and encounters a vague error which he doesnt understand. He cant save his profile anymore and doesn't know what to change.
The only thing he can do is leave your website. You lost a valid member.

But why would someone put a link in a field called "Country of Residence"
 
I assume that is not your only field that spammers and legitimate members can add URLs to.

I use this regex only for fields that should not have a URL. I presumed that was the point of this thread.

I did it beacuse I noticed spammers just put URLs iwhere they can
 
You are right that this could be used for fields that would never have an URL from valid users. Like the location field.
However, its not a solution for any other field that can get URLs from valid users. Like the 'about' field.
 
You are right that this could be used for fields that would never have an URL from valid users. Like the location field.
However, its not a solution for any other field that can get URLs from valid users. Like the 'about' field.

Exactly. I did it as avery primitive thing to to catch them out.
 
Back
Top Bottom