SustainableWWW Logo
Blog post image

Block unwanted and spammy bots with robots.txt and speed up your website

Your website might be fast right now, but one day that could change. One day a spammy bot might stop by your website and decide to terrorize you with requests that will slow down your website or even break it. It's the reality for many website owners, and most of them don't even know that it is happening. 

In this article, we will explore the robots.txt file and see how we can block unwanted and spammy bots. You will get a general understanding of how the robots.txt file works and why it's a good idea to use it on your website.

What is a robots.txt file?

In short: A robots.txt file is a set of instructions made for bots roaming around the internet. Its content is not made up of HTML, but instead, it contains simple words such as "Allow" and "Disallow". To make the rules of your website more specific you can specify the user agent which refers to the specific crawler. The commands you give the user agent could be to disallow certain routes or wildcards.

Here you can read more about what a robots.txt file is.

The robots.txt file contains the rules/law of your website and the good bots are likely to follow these rules while the bad bots won't. In a later section of this article, we go through what describes a bad bot and what is describing a good bot.

Here is a simple example of how a robots.txt file could look like. We disallow all user agents but then afterward allow the Google bot to crawl our website:

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

Why do we need a robot.txt file?

The robots.txt file contains a set of rules for your website and which user agents it applies. Without this set of rules, bots have no way of knowing how to interact with your website, and if there are routes you don't want to be indexed.

There are two types of bots out there. The good bots crawl your website to show it in search results, and the bad are spammy and might try to brute-force certain functions on your website.

Bots crawl your website to index it.

Most bots on the internet have the simple purpose of crawling every website to show them in search results on search engines such as Google, Bing, Yahoo, and DuckDuckGo. They find your website somewhere on the internet and goes through every page to check if your content is worth displaying in search results.

Bots does not understand redirects and some brute force attacks.

Not all bots visit your website just to crawl it. Some are scanning your website to find vulnerabilities that can let others get access to your database or break your website completely. Some of these "bad" bots are using methods like a brute force attack to guess usernames and passwords. Not every bot out there comes in good faith, and that's the bots that we would like to discourage from crawling our website. Bots can also come in good faith, but most bots don't understand simple redirects like 403, 500, and 404. This can lead to spammy requests that will slow down your website significantly.

This is how you add the robots.txt file to your website.

Adding a robots.txt file to your website is very easy. You start by creating an empty text file with the name: robots. After you add the URL route to your sitemap.xml if you have one and inserts the rules below it. To add this file to a static HTML website you simply add the file to the root of your project. Once you added the file you should be able to reach it by following this path: yourwebsite.com/robots.txt. If you can open it from your browser, it means the robots can too.

Here is how you can block unwanted bots using your robots.txt file.

The reason you are reading this article is that you want to know how you can block unwanted bots so that they don't slow down your website, and here it comes. Blocking unwanted bots is easy, and it only requires you to write a few lines of text. In our first example, you saw how to do that, but instead of simply disallowing every bot except a few, we encourage you to go through the list below and specifically add those you would want to block. Please don't copy/paste the whole section. Go through the user agents and pick out the ones you want to block and which ones you would like to crawl your website.

Using the robots.txt to block unwanted bots does not certainly keep them out, but hopefully, they will follow the rules.

User-agent: AhrefsBot
User-agent: AhrefsSiteAudit
User-agent: adbeat_bot
User-agent: Alexibot
User-agent: AppEngine
User-agent: Aqua_Products
User-agent: archive.org_bot
User-agent: archive
User-agent: asterias
User-agent: b2w/0.1
User-agent: BackDoorBot/1.0
User-agent: BecomeBot
User-agent: BlekkoBot
User-agent: Blexbot
User-agent: BlowFish/1.0
User-agent: Bookmark search tool
User-agent: BotALot
User-agent: BuiltBotTough
User-agent: Bullseye/1.0
User-agent: BunnySlippers
User-agent: CCBot
User-agent: CheeseBot
User-agent: CherryPicker
User-agent: CherryPickerElite/1.0
User-agent: CherryPickerSE/1.0
User-agent: chroot
User-agent: Copernic
User-agent: CopyRightCheck
User-agent: cosmos
User-agent: Crescent
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
User-agent: DittoSpyder
User-agent: dotbot
User-agent: dumbot
User-agent: EmailCollector
User-agent: EmailSiphon
User-agent: EmailWolf
User-agent: Enterprise_Search
User-agent: Enterprise_Search/1.0
User-agent: EroCrawler
User-agent: es
User-agent: exabot
User-agent: ExtractorPro
User-agent: FairAd Client
User-agent: Flaming AttackBot
User-agent: Foobot
User-agent: Gaisbot
User-agent: GetRight/4.2
User-agent: gigabot
User-agent: grub
User-agent: grub-client
User-agent: Go-http-client
User-agent: Harvest/1.5
User-agent: Hatena Antenna
User-agent: hloader
User-agent: http://www.SearchEngineWorld.com bot
User-agent: http://www.WebmasterWorld.com bot
User-agent: httplib
User-agent: humanlinks
User-agent: ia_archiver
User-agent: ia_archiver/1.6
User-agent: InfoNaviRobot
User-agent: Iron33/1.0.2
User-agent: JamesBOT
User-agent: JennyBot
User-agent: Jetbot
User-agent: Jetbot/1.0
User-agent: Jorgee
User-agent: Kenjin Spider
User-agent: Keyword Density/0.9
User-agent: larbin
User-agent: LexiBot
User-agent: libWeb/clsHTTP
User-agent: LinkextractorPro
User-agent: LinkpadBot
User-agent: LinkScan/8.1a Unix
User-agent: LinkWalker
User-agent: LNSpiderguy
User-agent: looksmart
User-agent: lwp-trivial
User-agent: lwp-trivial/1.34
User-agent: Mata Hari
User-agent: Megalodon
User-agent: Microsoft URL Control
User-agent: Microsoft URL Control - 5.01.4511
User-agent: Microsoft URL Control - 6.00.8169
User-agent: MIIxpc
User-agent: MIIxpc/4.2
User-agent: Mister PiX
User-agent: MJ12bot
User-agent: moget
User-agent: moget/2.1
User-agent: mozilla
User-agent: Mozilla
User-agent: mozilla/3
User-agent: mozilla/4
User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000)
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98)
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP)
User-agent: mozilla/5
User-agent: MSIECrawler
User-agent: naver
User-agent: NerdyBot
User-agent: NetAnts
User-agent: NetMechanic
User-agent: NICErsPRO
User-agent: Nutch
User-agent: Offline Explorer
User-agent: Openbot
User-agent: Openfind
User-agent: Openfind data gathere
User-agent: Oracle Ultra Search
User-agent: PerMan
User-agent: ProPowerBot/2.14
User-agent: ProWebWalker
User-agent: psbot
User-agent: Python-urllib
User-agent: QueryN Metasearch
User-agent: Radiation Retriever 1.1
User-agent: RepoMonkey
User-agent: RepoMonkey Bait & Tackle/v1.01
User-agent: RMA
User-agent: rogerbot
User-agent: scooter
User-agent: Screaming Frog SEO Spider
User-agent: searchpreview
User-agent: SEMrushBot
User-agent: SemrushBot
User-agent: SemrushBot-SA
User-agent: SEOkicks-Robot
User-agent: SiteSnagger
User-agent: sootle
User-agent: SpankBot
User-agent: spanner
User-agent: spbot
User-agent: Stanford
User-agent: Stanford Comp Sci
User-agent: Stanford CompClub
User-agent: Stanford CompSciClub
User-agent: Stanford Spiderboys
User-agent: SurveyBot
User-agent: SurveyBot_IgnoreIP
User-agent: suzuran
User-agent: Szukacz/1.4
User-agent: Szukacz/1.4
User-agent: Teleport
User-agent: TeleportPro
User-agent: Telesoft
User-agent: Teoma
User-agent: The Intraformant
User-agent: TheNomad
User-agent: toCrawl/UrlDispatcher
User-agent: True_Robot
User-agent: True_Robot/1.0
User-agent: turingos
User-agent: Typhoeus
User-agent: URL Control
User-agent: URL_Spider_Pro
User-agent: URLy Warning
User-agent: VCI
User-agent: VCI WebViewer VCI WebViewer Win32
User-agent: Web Image Collector
User-agent: WebAuto
User-agent: WebBandit
User-agent: WebBandit/3.50
User-agent: WebCopier
User-agent: WebEnhancer
User-agent: WebmasterWorld Extractor
User-agent: WebmasterWorldForumBot
User-agent: WebSauger
User-agent: Website Quester
User-agent: Webster Pro
User-agent: WebStripper
User-agent: WebVac
User-agent: WebZip
User-agent: WebZip/4.0
User-agent: Wget
User-agent: Wget/1.5.3
User-agent: Wget/1.6
User-agent: WWW-Collector-E
User-agent: Xenu's
User-agent: Xenu's Link Sleuth 1.1c
User-agent: Zeus
User-agent: Zeus 32297 Webster Pro V2.9 Win32
User-agent: Zeus Link Scout
Disallow: /

Wrapping up.

By reading this article, you should now have a better general understanding of what the robots.txt file is and how it is working. You know how to write the file to block specific bots, and you know that there is no certainty it will work.

Like we humans have laws and hope everyone will follow them, there is no certainty that everyone will do. It also applies to the robots.txt file, and hopefully, you will successfully block unwanted bots from crawling your website. A more certain way of blocking unwanted bots is by using your website's firewall to block bots using their IP addresses, but that would be a cat and mouse game.