
Block unwanted and spammy bots with robots.txt and speed up your website
Your website might be fast right now, but one day that could change. One day a spammy bot might stop by your website and decide to terrorize you with requests that will slow down your website or even break it. It's the reality for many website owners, and most of them don't even know that it is happening.
In this article, we will explore the robots.txt file and see how we can block unwanted and spammy bots. You will get a general understanding of how the robots.txt file works and why it's a good idea to use it on your website.
What is a robots.txt file?
In short: A robots.txt file is a set of instructions made for bots roaming around the internet. Its content is not made up of HTML, but instead, it contains simple words such as "Allow" and "Disallow". To make the rules of your website more specific you can specify the user agent which refers to the specific crawler. The commands you give the user agent could be to disallow certain routes or wildcards.
Here you can read more about what a robots.txt file is.
The robots.txt file contains the rules/law of your website and the good bots are likely to follow these rules while the bad bots won't. In a later section of this article, we go through what describes a bad bot and what is describing a good bot.
Here is a simple example of how a robots.txt file could look like. We disallow all user agents but then afterward allow the Google bot to crawl our website:
User-agent: * Disallow: / User-agent: Googlebot Allow: /
Why do we need a robot.txt file?
The robots.txt file contains a set of rules for your website and which user agents it applies. Without this set of rules, bots have no way of knowing how to interact with your website, and if there are routes you don't want to be indexed.
There are two types of bots out there. The good bots crawl your website to show it in search results, and the bad are spammy and might try to brute-force certain functions on your website.
Bots crawl your website to index it.
Most bots on the internet have the simple purpose of crawling every website to show them in search results on search engines such as Google, Bing, Yahoo, and DuckDuckGo. They find your website somewhere on the internet and goes through every page to check if your content is worth displaying in search results.
Bots does not understand redirects and some brute force attacks.
Not all bots visit your website just to crawl it. Some are scanning your website to find vulnerabilities that can let others get access to your database or break your website completely. Some of these "bad" bots are using methods like a brute force attack to guess usernames and passwords. Not every bot out there comes in good faith, and that's the bots that we would like to discourage from crawling our website. Bots can also come in good faith, but most bots don't understand simple redirects like 403, 500, and 404. This can lead to spammy requests that will slow down your website significantly.
This is how you add the robots.txt file to your website.
Adding a robots.txt file to your website is very easy. You start by creating an empty text file with the name: robots. After you add the URL route to your sitemap.xml if you have one and inserts the rules below it. To add this file to a static HTML website you simply add the file to the root of your project. Once you added the file you should be able to reach it by following this path: yourwebsite.com/robots.txt. If you can open it from your browser, it means the robots can too.
Here is how you can block unwanted bots using your robots.txt file.
The reason you are reading this article is that you want to know how you can block unwanted bots so that they don't slow down your website, and here it comes. Blocking unwanted bots is easy, and it only requires you to write a few lines of text. In our first example, you saw how to do that, but instead of simply disallowing every bot except a few, we encourage you to go through the list below and specifically add those you would want to block. Please don't copy/paste the whole section. Go through the user agents and pick out the ones you want to block and which ones you would like to crawl your website.
Using the robots.txt to block unwanted bots does not certainly keep them out, but hopefully, they will follow the rules.
User-agent: AhrefsBot User-agent: AhrefsSiteAudit User-agent: adbeat_bot User-agent: Alexibot User-agent: AppEngine User-agent: Aqua_Products User-agent: archive.org_bot User-agent: archive User-agent: asterias User-agent: b2w/0.1 User-agent: BackDoorBot/1.0 User-agent: BecomeBot User-agent: BlekkoBot User-agent: Blexbot User-agent: BlowFish/1.0 User-agent: Bookmark search tool User-agent: BotALot User-agent: BuiltBotTough User-agent: Bullseye/1.0 User-agent: BunnySlippers User-agent: CCBot User-agent: CheeseBot User-agent: CherryPicker User-agent: CherryPickerElite/1.0 User-agent: CherryPickerSE/1.0 User-agent: chroot User-agent: Copernic User-agent: CopyRightCheck User-agent: cosmos User-agent: Crescent User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 User-agent: DittoSpyder User-agent: dotbot User-agent: dumbot User-agent: EmailCollector User-agent: EmailSiphon User-agent: EmailWolf User-agent: Enterprise_Search User-agent: Enterprise_Search/1.0 User-agent: EroCrawler User-agent: es User-agent: exabot User-agent: ExtractorPro User-agent: FairAd Client User-agent: Flaming AttackBot User-agent: Foobot User-agent: Gaisbot User-agent: GetRight/4.2 User-agent: gigabot User-agent: grub User-agent: grub-client User-agent: Go-http-client User-agent: Harvest/1.5 User-agent: Hatena Antenna User-agent: hloader User-agent: http://www.SearchEngineWorld.com bot User-agent: http://www.WebmasterWorld.com bot User-agent: httplib User-agent: humanlinks User-agent: ia_archiver User-agent: ia_archiver/1.6 User-agent: InfoNaviRobot User-agent: Iron33/1.0.2 User-agent: JamesBOT User-agent: JennyBot User-agent: Jetbot User-agent: Jetbot/1.0 User-agent: Jorgee User-agent: Kenjin Spider User-agent: Keyword Density/0.9 User-agent: larbin User-agent: LexiBot User-agent: libWeb/clsHTTP User-agent: LinkextractorPro User-agent: LinkpadBot User-agent: LinkScan/8.1a Unix User-agent: LinkWalker User-agent: LNSpiderguy User-agent: looksmart User-agent: lwp-trivial User-agent: lwp-trivial/1.34 User-agent: Mata Hari User-agent: Megalodon User-agent: Microsoft URL Control User-agent: Microsoft URL Control - 5.01.4511 User-agent: Microsoft URL Control - 6.00.8169 User-agent: MIIxpc User-agent: MIIxpc/4.2 User-agent: Mister PiX User-agent: MJ12bot User-agent: moget User-agent: moget/2.1 User-agent: mozilla User-agent: Mozilla User-agent: mozilla/3 User-agent: mozilla/4 User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95) User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000) User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95) User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98) User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT) User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP) User-agent: mozilla/5 User-agent: MSIECrawler User-agent: naver User-agent: NerdyBot User-agent: NetAnts User-agent: NetMechanic User-agent: NICErsPRO User-agent: Nutch User-agent: Offline Explorer User-agent: Openbot User-agent: Openfind User-agent: Openfind data gathere User-agent: Oracle Ultra Search User-agent: PerMan User-agent: ProPowerBot/2.14 User-agent: ProWebWalker User-agent: psbot User-agent: Python-urllib User-agent: QueryN Metasearch User-agent: Radiation Retriever 1.1 User-agent: RepoMonkey User-agent: RepoMonkey Bait & Tackle/v1.01 User-agent: RMA User-agent: rogerbot User-agent: scooter User-agent: Screaming Frog SEO Spider User-agent: searchpreview User-agent: SEMrushBot User-agent: SemrushBot User-agent: SemrushBot-SA User-agent: SEOkicks-Robot User-agent: SiteSnagger User-agent: sootle User-agent: SpankBot User-agent: spanner User-agent: spbot User-agent: Stanford User-agent: Stanford Comp Sci User-agent: Stanford CompClub User-agent: Stanford CompSciClub User-agent: Stanford Spiderboys User-agent: SurveyBot User-agent: SurveyBot_IgnoreIP User-agent: suzuran User-agent: Szukacz/1.4 User-agent: Szukacz/1.4 User-agent: Teleport User-agent: TeleportPro User-agent: Telesoft User-agent: Teoma User-agent: The Intraformant User-agent: TheNomad User-agent: toCrawl/UrlDispatcher User-agent: True_Robot User-agent: True_Robot/1.0 User-agent: turingos User-agent: Typhoeus User-agent: URL Control User-agent: URL_Spider_Pro User-agent: URLy Warning User-agent: VCI User-agent: VCI WebViewer VCI WebViewer Win32 User-agent: Web Image Collector User-agent: WebAuto User-agent: WebBandit User-agent: WebBandit/3.50 User-agent: WebCopier User-agent: WebEnhancer User-agent: WebmasterWorld Extractor User-agent: WebmasterWorldForumBot User-agent: WebSauger User-agent: Website Quester User-agent: Webster Pro User-agent: WebStripper User-agent: WebVac User-agent: WebZip User-agent: WebZip/4.0 User-agent: Wget User-agent: Wget/1.5.3 User-agent: Wget/1.6 User-agent: WWW-Collector-E User-agent: Xenu's User-agent: Xenu's Link Sleuth 1.1c User-agent: Zeus User-agent: Zeus 32297 Webster Pro V2.9 Win32 User-agent: Zeus Link Scout Disallow: /
Wrapping up.
By reading this article, you should now have a better general understanding of what the robots.txt file is and how it is working. You know how to write the file to block specific bots, and you know that there is no certainty it will work.
Like we humans have laws and hope everyone will follow them, there is no certainty that everyone will do. It also applies to the robots.txt file, and hopefully, you will successfully block unwanted bots from crawling your website. A more certain way of blocking unwanted bots is by using your website's firewall to block bots using their IP addresses, but that would be a cat and mouse game.
Author
Michael Andersen
Michael Andersen is the author of Sustainable Web Design In 20 Lessons and the founder of Sustainable WWW (World-wide-web), an organization teaching sustainable practices. With a passion for web design and the environment, Michael solves puzzles to make the internet more sustainable.

Michael Andersen
Michael Andersen is the author of Sustainable Web Design In 20 Lessons and the founder of Sustainable WWW (World-wide-web), an organization teaching sustainable practices. With a passion for web design and the environment, Michael solves puzzles to make the internet more sustainable.