OpenAI Offers News Publishers $1m to Train its LLMs Using Their Content

January 6, 2024

OpenAI is proposing to pay only $1 million to news publishers for using their content to train its large language models (LLMs).

The company is also reportedly negotiating with about a dozen other publishers as they seek to avoid lawsuits for copyright infringement.

This comes as there have been several reports of complaints citing media organizations and artists, accusing AI firms of copyright infringement. The allegations are that AI firms use published archives of news articles to train their LLMs without the knowledge of the publishers.

Too small an amount

A Silicon Angle report notes that while the amount may seem too little given the rise of its LLM model ChatGPT, it all goes back to the nature of the agreement made between the two parties.

The amount, according to The Information, is too little, even for small news publishers. As a result, this may hamper OpenAI’s efforts.

Last December, OpenAI was reported to have struck a deal with publishing firm Axel Springer, a German publishing firm behind media brands like Politico and Business Insider.

Although the finer details of the deal remained sketchy, it is believed to be around tens of millions, according to executives cited by The Information.

Also read: Microsoft Adds Copilot AI Button to New PC Keyboards

More AI firms follow suit

Other AI firms are reportedly also attempting to negotiate a good deal with news publishers to use their articles to train LLMs.

Apple, for example, which is scrambling to catch up with OpenAI and Google in the generative AI field, is also trying to strike a deal with news publishers, according to an executive cited by The Information.

The company is also reportedly offering more money to news publishers than OpenAI, as it desires to have rights to use content “more widely” than its counterparts.

Sources close to the developments have indicated Apple prefers to enjoy a wider use of the content for “future AI products in a way the company deems necessary.”

The company has already penned agreements with news publishers like NBC News, Vogue, The New Yorker, The Daily Beast, and Better Homes and Gardens in a deal worth $50 million.

No free meals in AI

LLMs are pre-trained on vast amounts of data. But that data is not for free, so it seems. There is a price tag to everything, including the data used to train the LLMs. Recently, media organizations like The New York Times, Reuters, CNN, and Vox Media, which is Vogue’s parent company, blocked OpenAI and Microsoft Corp. from accessing their data.

Last December, OpenAI and Microsoft were slapped with a lawsuit by The Times alleging the two tech giants were using copyrighted content to train their models.

That’s not all. Reddit Inc. went after all companies that were using its content to train their LLMs. Popular writers have also teamed up and launched litigation against AI companies that used authors’ books in training their LLMs.

According to Silicon Angle, “training LLMs is going to be very expensive.”

Beyond data cost

The cost of LLM training goes beyond data availability. According to Forbes, “it requires thousands of Graphics Processing Units, or GPUs, offering the parallel processing power needed to handle the massive datasets these models learn from.”

GPUs alone cost millions of dollars. Forbes gives a technical overview of OpenAI’s GPT-3 language model and estimates that each training run needs at least $5 million worth of GPUs. More training runs are required, which further increases the cost.