Sat. Apr 20th, 2024

Google today announced that it has posted a Request for Comments to the Internet Engineering Task Force to formalize the Robots Exclusion Protocol specification after it being an informal 25-year-old standard for the internet.

Google wrote on its blog:

“Together with the original author of the protocol, webmasters, and other search engines, we’ve documented how the REP is used on the modern web, and submitted it to the IETF. The proposed REP draft reflects over 20 years of real-world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP.”

Inspite of its prevalence, REP never became an Internet standard. With developers interpreting the “ambiguous de-facto” protocol “somewhat differently over the years.” To eliminate this challenge, Google has finally documented as to how the REP is to used on the modern web and then submitted it to the Internet Engineering Task Force (IETF) for review.

Some of the important rules to be updated include:

  • Any URI based transfer protocol can use robots.txt. It’s not limited to HTTP anymore. Can be used for FTP or CoAP as well.
  • Developers must parse at least the first 500 kibibytes of a robots.txt.
  • A new maximum caching time of 24 hours or cache directive value if available, which gives website owners the flexibility to update their robots.txt whenever they want.
  • When a robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.

As Google puts it, “we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.”

More to the point, as it says in the GitHub repo, “The library is released open-source to help developers build tools that better reflect Google’s robots.txt parsing and matching.”

Leave a Reply

Your email address will not be published. Required fields are marked *