What is OpenAI’s latest web crawler: the GPTBot?

Since its release in November 2022, OpenAI’s ChatGPT has been revolutionising the way we search. The latest update to the technology, however, came this August, when OpenAI announced the GPTBot – a web crawler that will scrape websites to add to its training data and help improve its future models.

Of course, this has sparked controversy. Allowing the GPTBot to crawl your site could boost AI models significantly and even potentially help you get traffic in future, but many are concerned about copyright, compensation and privacy issues. What can we make of this latest news from OpenAI and what should we do about it? This article will dive into the facts and help you figure out what to do for your site.

What is the GPTBot?

Table of contents hide

I) What is the GPTBot?

II) How to block or restrict the GPTBot

III) What are the pros and cons of the GPTBot?

IV) Should I let the GPTBot crawl my site?

GPTBot is OpenAI’s new web crawler. Its goal is to collect publicly available data and information that will be used (according to OpenAI) to train future AI models and enhance their abilities – improving, for example, the results of a prompt in ChatGPT. In this way, GPTBot isn’t dissimilar to Google’s crawlers and fetchers, which also crawl the web to build a searchable index for Google’s search engine.

So, the GPTBot will be active on your site, scraping information to enhance its models. However, GPTBot won’t be able to access everything. OpenAI claims that the crawled pages are filtered “to remove sources that require paywall access, are known to gather personally identifiable information (PII) or have text that violates our policies”. In spite of these restrictions, you still may not want the bot using your information – especially since many critics argue that using data in this way without citation or compensation shouldn’t be permitted.

How to block or restrict the GPTBot

To restrict the bot, you first have to find it. You can identify GPTBot by this user agent and string:

User agent token: GPTBot

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

To stop the GPTBot from crawling your site, just edit your robots.txt file by adding the following:

User-age: GPTBot

Disallow: /

If you don’t want to block the bot completely, but do only want to allow it access to specific parts of your site, just add this GPTBot token to your robots.txt file:

User-agent: GPTBot

Allow: /directory-1/

Disallow: /directory-2/

What are the pros and cons of the GPTBot?

A huge pro of using the GPTBot is that it will be able to access more accurate, reliable and up-to-date information to use in ChatGPT. In allowing the bot to crawl your site, you are contributing your expert information to this and improving the answers provided in AI search. If it works as planned, it will boost AI models, bolster their data pools and allow you and your site to shape and influence how future AI models will be trained.

On the other hand, as of right now there are few very immediate benefits to allowing GPTBot to crawl your site. While Google drives traffic to the sites it crawls, showing it in the search results as a relevant and reliable information source for a query, ChatGPT does not cite its sources. Instead, it simply summarises the data it finds from a range of different sources. This means that as of right now, you won’t get any more organic traffic or clicks from letting the GPTBot crawl your site.

There are, however, future benefits. OpenAI claims that it will cite its sources in the future – and so in blocking it from accessing your content now you may be excluding yourself from this future traffic source. However, even here it’s important to note that we don’t yet know how OpenAI will present your content, and whether what you say could be taken out of context or used to argue a counterpoint to your actual perspective.

It is true that upcoming benefits may end up outweighing current concerns. Ultimately, blocking or restricting the GPTBot could be a proactive way of protecting your content (especially if you have client assets that you need to protect). However, it may be useful for SEO in future and worth enabling for these benefits.

Should I let the GPTBot crawl my site?

It is clear that there are strong arguments on both sides. Controversy and concern are going head-to-head with developments and potential.

Although the concerns around citation, copyright and plagiarism are certainly valid, it would be rash to make a yes or no decision without having enough information on this new technology. AI may well become a new discovery and traffic source in the coming months and, in blocking these crawlers, you could be depriving yourself of these benefits. It is also already the place where many new users start their search journey, and you don’t to be left behind or give your competition the edge by blocking it immediately. For now, consider simply restricting the bot, blocking it from your most sensitive pages to protect important data, and allowing it to crawl some of your content.

The best thing you can do is give careful thought to your specific needs and priorities. For a personal consultation on how AI can work for your site and strategy, check out our AI Solutions service today or speak to our experts now. We’re happy to help.

Article originally posted by our Partner Agency Peak Ace.