ChatGPT came for your data. Is it coming for your traffic next?

February 2023

ChatGPT seems to have caught the interest of the online world, as new uses for the conversational AI are discovered almost constantly. One application that holds the potential to bring significant disruption is the integration of ChatGPT into search technology.

Hailed as a possible “Google Killer”, ChatGPT powered search is being seen by many as a welcome contender to break Google’s near-monopoly in search. Whilst I’d personally welcome the return of true competition in search, I am concerned that publishers may be welcoming a trojan horse through the gates with these large language models.

Can ChatGPT really replace search?

Whilst I have a healthy dose of scepticism about the accuracy of the current model, there are times when having ChatGPT open as a research assistant whilst I write is already a huge time-saver. For informational searches, where recency of data and absolute accuracy are not critical it is already a useful tool. For more transactional searches, it is currently far less practical, but things move fast in the world of large language models.

Another challenge to conversational AI taking market share from keyword-based search is one of computational power. Large language models like ChatGPT require a significant amount of computational power to interpret a request and form a reply, beyond what a keyword-based search like Google requires. This could become a limiting factor as the adoption of these systems grows.

Despite these challenges, it seems increasingly likely that we’ll see ChatGPT capabilities added to Microsoft’s Bing search ending before long. At the time of writing, Microsoft has committed to invest an eye-watering $10 billion in OpenAI, the company behind ChatGPT, and clearly expects a return on that investment. The tech rumour mill sees a ChatGPT powered Bing as a key part of that with some optimistic predictions that this could be released as early as March 2023.

What would ChatGPT powered Bing look like?

No one outside of Microsoft and OpenAI really knows what the end product of their collaboration will look like, but there is enough speculation around the topic that I may as well join in. Personally, I suspect that the goal is more about on-device and in-application assistants than a service where users visit bing.com and tell ChatGPT what they’re looking for.

I anticipate that the integration of GPT into search technology will serve as an interim solution. This may take the form of Google-style rich snippets, where the results from the conversational AI are displayed alongside traditional search results. Additionally, the conversational AI could offer additional assistance in response to specific search queries, much like a virtual assistant. For those who remember the old Microsoft Office assistant “Clippy,” this integration could be seen as a more advanced version of that technology.

The comparison with Google’s rich snippets is a particularly interesting one to me, as it highlights what I see as the big risk that conversation AI poses to publishers: It’s going to steal your content and then use that to steal your traffic.

Beware of Geeks bearing gifts

When search engines look to enhance basic search results it rarely ends well for publishers. Those rich results are great for end users, as they can answer their queries straight from the search result page. Reviews, recipes, FAQ answers, definitions, products, places, TV & movies – the subjects of rich results are already huge and each one takes traffic away from publishers.

Studies into these so-called “zero click searches” estimate that anything from 25% to 50% of searches is already answered on-page without visiting the source of the information. After all, why click and load a page if you can already see what you were looking for?

Large language models, like ChatGPT, have the potential to take another very significant chunk out of publisher traffic, and of course, they are doing it by ingesting publisher data. That means that every publisher putting unique original information, views, analysis, or opinions out of the web is helping to train the tools that will later take their traffic.

Let’s consider a specific example to demonstrate the issue. I’ve recently been discussing these issues with the publisher of automotive content. Some of their most popular content is original reviews of new vehicles, which they send specialist journalists to test and write about at considerable expense. The resulting content is a blend of facts, opinions, data, and each writer’s own style: All of which are ingested to feed these language models. Ask ChatGPT to discuss those cars and that expensive, original first-hand experience forms part of their response.

Luckily for this particular publisher, ChatGPT’s responses are currently quite poor. The data used to train ChatGPT cuts off in 2021, so the current version has limited knowledge of anything that happened after that (including updates to cars and new models). It’s a fast-moving space and when you consider that ChatGPT could not even do basic addition a few weeks ago it isn’t safe to assume that any failings will be long-term.

How serious an issue is this?

In truth, we don’t yet know. Some will argue that Google has been using our own content against us for years. So Bing joining in won’t make much difference. I suspect that – in the short term at least – they are right.

However, this is a trend that is currently only moving in one direction. Google is certainly not sitting by and watching the conversational AI space develop without them either. They’ve been investing in their own large language model LaMDA for years and news of a possible GPT based search competitor has apparently put the company on “Code red” to release their own solution. The possibility of a step change in the number of zero-click searches from Google may be a more immediate concern for publishers.

What can publishers do about this?

I hate to be the bearer of bad news, but the practical answer is “nothing”. These large language models take their data from multiple sources. There is no robots.txt directive to stop these systems from retrieving and training themselves on your content and no meta-tag to stop them spinning and regurgitating your IP.

Blocking CCBot, the spider used for collecting data for Common Crawl, has been suggested as a solution by some individuals. However, this approach has limited effectiveness. Although Common Crawl is used as one of the data sources for training ChatGPT, it is not the sole source, so blocking CCBot may provide some improvement but cannot be considered a reliable method.

Furthermore, OpenAI has yet to publish the complete dataset that was used for training ChatGPT (They did publish the details of the smaller GPT3 dataset here, but ChatGPT contains additional sources). As a result, it is unclear whether blocking CCBot will have a significant impact.

My suspicion is that robots or metatag directives will come in time, but probably only once lawyers have been dragged into the fight. AI Image generation systems Midjourney and Stable Diffusion have already been targeted by lawsuits claiming that AI art tools violate copyright laws. AI art tools work in a similar way to ChatGPT in that they scrape other people’s content without permission to use it as training data. Whilst this first lawsuit might not have the clout to go the distance, I can see a few busy years ahead for IP lawyers.

For now, the best advice for publishers is the same advice that I have given in response to so many other questions: Connect with your audience, build loyalty, and build less reliance on traffic from big tech.

Interested in learning more? Learn about brand safety in advertising and IAB ALM 2023 from our team of experts.