AI Companies vs. The Open Web

Robb Knight recently posted how Perplexity is lying about their user agent. I wrote a short comment on Hacker News around this and there were interesting responses to it that spurred some debate. So I wanted to expand on the tensions between what users want, what publishers need, and how the tradeoffs affect the ecosystem of the open web.

Robb Knight’s blog is the latest story in a string of earlier stories about how content creators are upset that their content is used by AI companies without permission or compensation. Most of these stories have focused on AI companies’ use of content to train their models.

What people miss is that there are two use cases where AI companies are using content from the open web:

  1. To train their model
  2. In real time to answer user questions e.g. via RAG (retrieval augmented generation)

Use case 1: Model training

When content is used to train a model like GPT-4, Claude or Llama, it is likely part of a very large data set and each piece of content has very little influence over the final neural network that is trained. I consider this fair use but this is being debated (and litigated). Regardless, some content publishers have recognized that

  1. their content has higher value than other web content in the training data. These are so-called “high value tokens”. e.g. presumably the value or quality of an article in the Financial Times > a Reddit post > a 4chan post.
  2. they have the negotiating leverage and legal budget to get AI companies to pay them for using their content.

These are large publishers like Axel Springer or the Associated Press, as well as sites like Reddit and StackOverflow who technically do not own the copyright for the content (it belongs to the user who posted that content) but are able to leverage their position as a platform to extract compensation from AI companies.

Where does that leave the little guy – independent publishers like HouseFresh or Diffen? Without a class action or consortium, it is difficult for them to get compensated for the value of their tokens because

  1. even if they have great content, their content is not must-have. An AI company could remove an individual indie publisher from their training data and still be fine.
  2. The cost of compensating small publishers is too high – both in terms of operational cost of doling out money to hundreds of thousands of publishers, but also the exercise of valuing their content.

I believe companies like Raptive (previously known as AdThrive) are trying to do something here to negotiate on behalf of their publishers but I’m not holding my breath.

Use case 2: Crawl initiated to answer a specific user question

This use case is when generative AI is used to answer a user’s question by finding say the top 20 search results and then combing their content and distilling it into a short summarized answer. This is called RAG (retrieval augmented generation). Perplexity and Google’s AI overviews in Search fall into this category.

I am hypocritical about Perplexity: as a user, I love it; as a publisher, I want to block it.

When I have somewhat complex questions, Perplexity shines at doing the research for me and leading me straight to the answer. Without Perplexity, one would have to search on Google, visit 3-4 different web pages and infer the answer oneself. Perplexity is able to visit more web pages and summarize them to answer my specific questions. It saves me a lot of time and I have found the answers to be generally trustworthy. Unlike ChatGPT, I have not seen many hallucinations with Perplexity. So to reiterate, as a user I love Perplexity.

As a web publisher, I find it unacceptable that Perplexity hides their user agent and does not allow publishers to opt out. I would probably want to opt Diffen out of serving such requests from Perplexity or Google AI overviews if given the option. Why?

The original Google search was built on an implicit quid pro quo between Google and publishers: publishers would let Google crawl, index and cache their content. In return, Google would surface this content in their search results. Only the title and short snippet would be surfaced and publishers controlled both of these. Publishers would get traffic from Google if they had good content.

Over time, Google reneged on their end of the deal. They started showing longer and longer featured snippets and answers in search so that they could keep traffic on Google properties. Google has been stealing content and traffic from publishers for years now. Exhibit A: CelebrityNetWorth content shown in featured snippets on Google reduced traffic to their site. In 2020, two thirds of all searches on Google led to zero clicks. So the frog has been boiling for years and this new challenge with Perplexity and AI overviews in Google Search is just another temperature ratchet.

As a publisher, my take on both Perplexity and AI overviews is that at this time their value to my business is low because I do not believe most people click on the citations, so the incremental traffic they will deliver is low.

Giving away your content in return for this low probability of getting traffic is not a tradeoff worth making. You might argue that the alternative is getting zero traffic from these sources. And low > zero. That is true, but I do not feel right feeding the beast that is going to eventually eat me. The reason Google is so powerful is because small publishers do not have a direct relationship with users. Google has aggregated all demand (users) and commoditized all suppliers (content providers). Classic aggregation theory. So it does not feel right to cooperate with Google and Perplexity as they make users habituated to trusting their AI overviews and never clicking out. This only entrenches their position as aggregators.

Whither the open web?

Some publishers monetize via ads or affiliate links. For them, losing traffic is an existential risk. But what is often missed is that other businesses also lose when Google/Perplexity steal their traffic.

Say you are a SaaS provider and you have a bunch of Help docs on your website. Your docs not only help your customers figure out how to use your features but you may also have content that potential customers may find as they are looking for a solution to the problem that you solve. e.g., a user is wondering how to import their brokerage transactions into their tax software. You are a tax software provider that does offer this functionality and have an article explaining this. You could acquire this user if they find (via the 10 blue links version of Google search) that your software has this feature while the one they are using currently does not.

Another obvious example is “software_name pricing plans” search queries. Google/Perplexity could give the user the answer here but if the user does not come to your website then you can’t offer them a free trial, can’t run A/B tests on your pricing, and can’t retarget them for remarketing. In short, you can’t convert traffic that never lands on your site. To get around this, you will have to run ads for such keywords so they show up before, or instead of, the organic answer. This suits Google just fine.

There is also value to you in having your existing customers come to your site rather than just read the answer on Google. You can see what Help articles are more popular, and use this information to figure out what parts of your user interface are not intuitive. Or you could analyze searches made on your Help site to get ideas for new features to build or new Help articles to write.

Finally, returning to the small web publishers relying on ads or affiliate income, they are now default dead. Writing content, attracting an audience via organic search and monetizing via ads is no longer a viable business model where independent publishers can earn a decent livelihood. So what are the implications?

  1. More and more content will be AI-generated. To make the model viable, people will dramatically lower the cost of writing content by getting a machine to do it. When the marginal cost of creating content goes to essentially zero, it is possible that whatever little traffic you do get from the aggregators like Google is still positive ROI for you. But who creates net new content now for an LLM to summarize?
  2. A handful of large media publishers will still create content and be paid for it. 6 companies control 90% of media and 16 companies control 90% of Google search results. As independent bloggers and small publishers get squeezed out, this oligopoly is only going to get further entrenched, which is not great for the open web.
  3. People will try to spam the platforms that are the current winners (e.g. Reddit) of Google HCU (“helpful” content update, euphemism for their updates that reward large brands and destroyed indie publishers even if they had great content). You will be able to trust Reddit content less and less in the coming years.

The Internet has been a great democratizing force that allowed anyone to become a publisher. For the first couple of decades of Google’s existence, great content did win even if it was written by a blogger in a basement. But the open web is now headed to a future where independent voices will get a small fraction of the distribution they used to garner, which will discourage them from publishing, which will lower their reach even further.

We will consume our information from AI agents that rely on content from a handful of large media companies.

References