As any WordPress site operator knows, managing robots is a never ending task. Foreign search spiders, security scanners, SEO optimizers, and more visit pretending to be legitimate visitors or even GoogleBot are all a cause for concern. In this article, we will cover ways to secure your site against the latest robots to ensure your site runs at top speed.
* Sorting traffic types
The first step is to identify which traffic you would like to deny. For example, does your audience include visitors from around the world? If so, you may want to be indexed by Yandex or Baidu. If not, you can restrict bots to Google, Bing, and others popular with your target audience. Beyond search engine spiders, many companies like Apple, Google, Microsoft, and more run AI spiders which collect information to train the next generation of AI. Do you want your site used as part of that? For projects publishing open source libraries and documentation, it might be good to have your information incorporated. If you are an author of important industry research, you may not want it included without licensing.
The robots.txt, ai.txt, and similar files give you control over the folders and types of media you would like to make accessible to legitimate crawlers. We will cover more about these later.
* Legitimate vs Rogue Visitors
Once you have identified your target audience and allowable spiders, the next step is begin filtering. Legitimate spiders from corporations such as Google or Microsoft will obey the robots.txt and ai.txt mentioned before. Unfortunately, rogue scrapers will not and that is where plugins such as this one help trap them:
https://wordpress.org/plugins/blackhole-bad-bots/
Plugins like these set traps for rogue spiders to fall into such as setting a directory off limits named pictures or research papers. Legitimate robots obeying robots.txt will avoid these folders while rogue scrapers will not allowing your security system to identify them. They can then be blocked and filtered by IP, user-agent, or more.
For other PHP based websites, https://perishablepress.com/blackhole-bad-bots/ can provide a similar type of protection. This is edited into the PHP code for index.php and others that would need protection.
* AI Scrapers and Your Site
While search engine spiders may have been around for a long time, they were generally welcomed because of search engine traffic sending visitors to your website. The era of AI isn’t quite the same with more and more searches answered on Google without click through to your website. AI summaries of results such as the content on your pages are being incorporated even on smaller search engines like Brave and DuckDuckGo.
What can you do? Using robots.txt and ai.txt together, you can control what these spiders are allowed to use and where they are allowed to go on your site. You may want to allow access to open source contributions while restricting access to industry research or videos. Note: This only works for legitimate spiders. For abusive spiders that ignore these, the blackhole trap is still needed.
* The robots.txt and ai.txt
These are plain text .txt files placed in the public_html docroot of your website. The following are examples that can be used to block all AI scraping and AI training spiders. Remove any types or spiders you would like to allow.
https://raw.githubusercontent.com/jackmcconnell/no-ai-scrapers/refs/heads/main/ai.txt
https://perishablepress.com/ultimate-ai-block-list/#block-ai-via-robots
For example, to allow Google’s AI training crawlers to access your website you would start by removing the following four lines:
User-agent: Google Bard AI
User-agent: Google-CloudVertexBot
User-agent: Google-Extended
User-agent: GoogleOther
Note: GoogleBot the search engine spider for Google Search is not blocked by these. Only AI training is blocked.
For more thorough filtering, you can incorporate additional security which we will cover in future posts. Until then, keep an eye on our blog for highlights on security and optimizing updates!