Known Bots

Known Bots 6.0.3

No permission to download
The lists can be used to block by user agent at the web server. Those two places are just good sources of user agent info.
Yeah I just... Pictured in my mind a hundred guys copying the .txt files and adding them to their robots.txt file, thinking that'll work.
 
What user agent do those bots use?

To be clear: this addon does not and cannot identify robots based on IP address - it only ever considers the user agent. I won't be changing that functionality in this addon.

If you have bots which are masking their user agent, then you should just block them at the web server level - I have a simple deny list in nginx which I add bad IP addresses to.
These bots are from Huawei's new search engine called "Aspiegel".

Name:AspiegelBot
User Agent Key:AspiegelBot
User Agents:Mozilla/5.0 (compatible; AspiegelBot)
Web Address:https://aspiegel.com/about
 
Last edited:
  • Like
Reactions: Sim
We have been filtering through our User Agent logs and determining bots that are scraping our website in an effort to get a bigger picture of who complies to our robots.txt and who doesn't.

I have found a few more spiders that don't appear to be in your list:

Source:TikTik Parent Company - New Chinese Search Engine
Name:Bytespider (by Bytedance)
User Agent Key:Bytespider and Bytespider;bytespider@bytedance.com
User Agents:"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.3754.1902 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.4454.1745 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.7597.1164 Mobile Safari/537.36; Bytespider;bytespider@bytedance.com",
"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2988.1545 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.4141.1682 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.3478.1649 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.5267.1259 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.7990.1979 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.2268.1523 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2576.1836 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.9681.1227 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.6023.1635 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.4944.1981 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.3613.1739 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.4022.1033 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.3248.1547 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.5527.1507 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.5216.1326 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.9038.1080 Mobile Safari/537.36; Bytespider"
Web Address:https://bytedance.com/

I have also noticed the SEMRUSH bot is falling through the filter when user agents are being identified, I have not looked extensively into your code to see if you are wild-carding around the agent or not, but I would suggest adding the variants of the semrush user agent if not, as it is still appearing in our guests list.

Some of the other semrush UA's are as follows:
  • SemrushBot-SA
  • SemrushBot-BA
  • SemrushBot-BL
  • SemrushBot-SI
  • SemrushBot-SWA
  • SemrushBot-CT
  • SemrushBot-BM
Great work by the way! Thank you for supporting the community, Hope you are having a great Easter (if that is your thing)!

Edit: Realized I forgot to list a resource that could be useful in identifying robots that may be missing (https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json)
 
Last edited:
We have been filtering through our User Agent logs and determining bots that are scraping our website in an effort to get a bigger picture of who complies to our robots.txt and who doesn't.

I have found a few more spiders that don't appear to be in your list:

Source:TikTik Parent Company - New Chinese Search Engine
Name:Bytespider (by Bytedance)
User Agent Key:Bytespider and Bytespider;bytespider@bytedance.com
User Agents:"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.3754.1902 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.4454.1745 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.7597.1164 Mobile Safari/537.36; Bytespider;bytespider@bytedance.com",
"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2988.1545 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.4141.1682 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.3478.1649 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.5267.1259 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.7990.1979 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.2268.1523 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2576.1836 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.9681.1227 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.6023.1635 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.4944.1981 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.3613.1739 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.4022.1033 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.3248.1547 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.5527.1507 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.5216.1326 Mobile Safari/537.36; Bytespider",
"Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.9038.1080 Mobile Safari/537.36; Bytespider"
Web Address:https://bytedance.com/

Thanks @VersoBit I have seen this bot in my web server logs recently too - it has been added to the list.

I have also noticed the SEMRUSH bot is falling through the filter when user agents are being identified, I have not looked extensively into your code to see if you are wild-carding around the agent or not, but I would suggest adding the variants of the semrush user agent if not, as it is still appearing in our guests list.

Some of the other semrush UA's are as follows:
  • SemrushBot-SA
  • SemrushBot-BA
  • SemrushBot-BL
  • SemrushBot-SI
  • SemrushBot-SWA
  • SemrushBot-CT
  • SemrushBot-BM

SemrushBot should be detected correctly - all of the useragent strings I've found in my logs for them are correctly detected?

I've added a new tool in v2.7.0 of the addon to help with troubleshooting bot detection - you simply paste in a user agent string and it will tell you which bot it detects (if any):

1589695188270.webp

Edit: Realized I forgot to list a resource that could be useful in identifying robots that may be missing (https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json)

FYI - I've decided not to bulk-add bots based on some list which may be full of dead bots, but instead will continue to add bots as they are detected in web server logs.

To this end, I have it on my todo list to build a tool which helps identify bots and logs them to the database for analysis - I plan on making it easy for other admins to send this data to me so I can include new bot definitions in this addon.

My algorithm will be pretty basic - for every session created:
  1. check whether the system detects the user agent as a bot, if so ignore it (already a known bot)
  2. for user agents not already detected as a bot, do a simple match to see if one of the strings "bot" | "spider" | "crawl" is found, if not, ignore it (probably not a bot)
  3. for user agents containing one of those strings that aren't already detected as bots, log them to the database for analysis
  4. provide a UI for admins to copy and paste the list of user agents into a private message to send to me
  5. allow old user agents to be purged from the system occasionally
Theoretically it should only be writing to the database for bots, not humans - so performance shouldn't be adversely affected. It will only write to the database new user agent strings it has not already found that we don't already recognise as bots, so I expect volume would not be high. The search strings are not guaranteed to find everything - but it's much better than nothing.
 
Code:
3.124.4.30 - - [17/May/2020:22:09:18 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
3.234.221.231 - - [17/May/2020:22:09:18 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
18.141.137.82 - - [17/May/2020:22:10:11 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
54.198.215.110 - - [17/May/2020:22:10:12 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
46.137.192.104 - - [17/May/2020:22:10:20 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
18.156.198.249 - - [17/May/2020:22:10:23 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
54.211.213.171 - - [17/May/2020:22:11:03 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
54.93.236.46 - - [17/May/2020:22:11:04 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
18.141.137.82 - - [17/May/2020:22:11:10 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
52.221.192.72 - - [17/May/2020:22:11:16 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
35.158.137.29 - - [17/May/2020:22:12:03 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
3.235.79.241 - - [17/May/2020:22:12:07 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
54.251.78.22 - - [17/May/2020:22:12:11 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
52.221.192.72 - - [17/May/2020:22:12:14 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
18.141.174.37 - - [17/May/2020:22:13:08 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
18.156.163.52 - - [17/May/2020:22:13:08 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
13.229.207.209 - - [17/May/2020:22:13:15 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
34.238.137.89 - - [17/May/2020:22:13:23 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
3.229.124.106 - - [17/May/2020:22:14:08 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
54.151.207.202 - - [17/May/2020:22:14:09 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
13.229.207.209 - - [17/May/2020:22:14:14 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
3.123.0.61 - - [17/May/2020:22:14:20 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
18.196.203.61 - - [17/May/2020:22:15:09 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
54.157.246.127 - - [17/May/2020:22:15:10 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
18.141.159.13 - - [17/May/2020:22:15:11 -0500] "GET / HTTP/1.1" 200 15441 "-" "HetrixTools Uptime Monitoring Bot. https://hetrix.tools/uptime-monitoring-bot.html X-Middleton/1"
There you go :)
 
  • Like
Reactions: Sim
My algorithm will be pretty basic - for every session created:
  1. check whether the system detects the user agent as a bot, if so ignore it (already a known bot)
  2. for user agents not already detected as a bot, do a simple match to see if one of the strings "bot" | "spider" | "crawl" is found, if not, ignore it (probably not a bot)
  3. for user agents containing one of those strings that aren't already detected as bots, log them to the database for analysis
  4. provide a UI for admins to copy and paste the list of user agents into a private message to send to me
  5. allow old user agents to be purged from the system occasionally

I was going to suggest you do something similar to this :) (here is a relevant source for your idea of checking bot|spider|crawl: https://webmasters.stackexchange.co...t-in-any-regular-browser-contain-bot-or-crawl)

Only thing I would change is to make #4 into something like an admin option "Allow automatic reporting of suspected robots to plugin developer".
This reporting would then be an async job run at certain intervals to post the list of recently suspected bots to your own API

Then what I would like to see is that the plugin itself could do a nightly pull of the list of known-bots from your API, so that instead of forum owners having to update the plugin, it does it automatically for the known-bot list.
i.e. I should only have to update a plugin when there is new feature/bugs fixed ;)
 
Also, I would actually want another option as well:
- "Aggressive mode" where any user agent containing "bot|crawl|spider" is automatically tagged as a robot.
 
Is there no way to add additional bot detection strings? Even if you didn't want to take the time to add something to the UI, perhaps even a flat file that users could edit that would then be integrated into the settings but wouldn't be overwritten by future updates?
 
Is there no way to add additional bot detection strings? Even if you didn't want to take the time to add something to the UI, perhaps even a flat file that users could edit that would then be integrated into the settings but wouldn't be overwritten by future updates?
The bot detection strings are hard coded for performance reasons.

With a bit of care, we could build a user-updateable system which does not add any performance impact, but I don't have time to do so right now.

For now, I'll continue to manually add them and release new versions with the latest updates.

So if you have identified bots, please post them here and I'll add them.

Everyone benefits from new bots being added to the list.
 
With a bit of care, we could build a user-updateable system which does not add any performance impact, but I don't have time to do so right now.
With a license as you provided, I don't think anyone can complain.

Thanks for the add on! :)
 
Sim updated Known Bots with a new update entry:

v3.1.0 major update

v3.0.0 update (unreleased)

Major new feature: add generic bot detection

User Agent strings are scanned for the keywords "bot", "crawl", or "spider" - any User Agents not already detected as a bot which contain one of these strings are stored in the cache and made visible through the admin UI, with the option to have this information emailed on a weekly basis.

new bots: AccompanyBot; PostmanRuntime

v3.1.0 update (took less than 5 minutes after installing it on several of my...

Read the rest of this update entry...
 
Top Bottom