Custom 404 Page by Siropu

Custom 404 Page by Siropu 1.2.0

No permission to download
It could be a bot/proxy that doesn't send any IP information.

How could that be? I'd have assumed that it is unavoidable to send the source IP given how TCP/IP works and apart from that: The IPs are logged in the server's web.log when you grep it with the URL that is noted in the log of the 404-add on. So they are there.

Bots can manipulate the headers used to get the user IP so they are not reliable.

I'm using XF's getIp() method with $allowProxied set to true to get the IP so unless convertIpStringToBinary fails, which is used to get the value stored for the IP, what you see is what you get in the 404 logs.


Had a further look into it and apart from malicious bots also the bing bot it affected. I.e I had this call in the log that ended with a 404:

Bildschirm­foto 2025-04-08 um 17.58.05.webp
upper one:
msnbot-40-77-167-149.search.msn.com - - [08/Apr/2025:15:15:17 +0200] "GET /tags/start/ HTTP/1.1" 404 9472 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36" 315 10202

The URL "/tags/start/" exists (there is a tag "start"). It seems only accessble when you are logged in (which bingbot is not). If I call the URL when not logged in I do indeed get a 404 - a bit strange, I would have expected a 401 here. Clearly not your fault - rather a bug in XF that may impact SEO ranking to the negative (as bing will this way get hundreds of 404s on my forum)


second one:
msnbot-40-77-167-76.search.msn.com - - [08/Apr/2025:14:26:41 +0200] "GET /threads/Threadurl.1338/Picture-Name.jpg HTTP/1.1" 404 9505 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36" 796 14614

There are hundreds of those. The requested picture does exist (it is embedded in the thread) but as full size pictures are only accessible to logged in users the bot won't get the ful resolution pic. It would however have a different URL than the bot requested anyway - no idea where he got his from.

Anyway: This makes the log pretty useless as it is spoiled with 100s of entries caused by bing that can only manually be identified by grepping through the web.log.
 
This makes the log pretty useless as it is spoiled with 100s of entries caused by bing that can only manually be identified by grepping through the web.log.
Do you have any suggestions on how to improve it? Maybe identify legit bots and do not log those?
 
Do you have any suggestions on how to improve it? Maybe identify legit bots and do not log those?
Could be an idea. A low hanging fruit would be to add the user agent to the log, possibly abbreviated, and in the next step to give the opportunity to filter on or a number of them out.

Reason is that potentially one wants to see if indexing bots get a 404 as this might be important for SEO ranking and maybe some of those are fixable.

In the perfect world one would have kind of clusters like bot/no known bot and within the bots cluster a classification into groups like wanted, unwanted/ignorable or eventually something like "Search engine indexing bot" or alike. If one could create or name such clusters indivually and add useragents to them manually this would be even better.

Could be complicated but on the other hand maybe be an interesting idea to somehow interact with "Known Bots" by @Sim

I think the most important thing is to know on the spot what source a request that results in a 404 has to be able to identify if and how one wants to deal with that 404. Everything on top of that is a comfort function that makes life easier and the tool more useful.
 
Could be complicated but on the other hand maybe be an interesting idea to somehow interact with "Known Bots" by @Sim

The good news is that my KnownBots addon simply extends the core functionality which already flags user sessions when it detects they are a bot - so if you want to indicate that somehow, the information is already there in the user session.
 
This URL appears as 404 in the records, but this URL is not listed on Google? How does this happen? Does this plugin also count bots? I don't understand?
It counts 404 - page not found answers by XF. So it clearly counts in bots as well. You can easily crosscheck with your webserver log and will pretty safely find the call there (including the IP and useragent it came from).
 
Back
Top Bottom