Anthropic’s next generation AI model Claude 3 Opus has taken pole position on the Chatbot Arena leaderboard, pushing OpenAI’s GPT-4 to second best.
Since it was launched last year, this is the first time that the Claude 3 Opus model has topped Chatbot Arena list, which has all the three Claud 3 versions ranked in top 10.
Claude 3 models make a mark
The LMSYS Chatbot Arena rankings show that Claude 3 Sonnet occupied a joint fourth position with Gemini Pro while Claude 3 Haiku, which was launched this year ranked sixth together with an earlier version of GPT-4.
Although Claude 3 Haiku might not be as intelligent as Sonnet or Opus, the model is faster and significantly cheaper, yet it is “as good as the much larger models on blind tests,” as the results of the arena reveal.
“Claude 3 Haiku has impressed all, even reaching GPT-4 level by our user preference! Its speed, capabilities and context length are unmatched now in the market,” explained LMSYS.
According to Tom’s Guide, what makes Haiku more impressive is that it’s the “local size model comparable to Gemini Nano.” It can read and process information-dense research papers in less than three seconds.
The model is achieving great results even without trillion plus parameter scale of Opus or any of the GPT-4-class models.
[Arena Update]
70K+ new Arena votes🗳️ are in!
Claude-3 Haiku has impressed all, even reaching GPT-4 level by our user preference! Its speed, capabilities & context length are unmatched now in the market🔥
Congrats @AnthropicAI on the incredible Claude-3 launch!
More exciting… pic.twitter.com/p1Guuf0B3K
— lmsys.org (@lmsysorg) March 26, 2024
Could this be a short-lived success?
Despite being pushed to second position, OpenAI’s GPT-4 versions still dominated the top 10 on the list with four versions.
According to Tom’s Guide, OpenAI’s GPT-4 versions in their various forms have held the top spot “for so long that any other model coming close to its benchmarks is known as a GPT-4-class model.”
With a “markedly different” GPT-5 expected some time this year, Anthropic might not hold that position for too long, as the gap in scores between Claude 3 Opus and GPT-4 is narrow.
Although OpenAI has remained tight-lipped on the actual release of its GPT-5, the market highly anticipates its launch. The model is reportedly undergoing some “rigorous safety testing” and simulated attacks which are crucial before release.
The LMSYS Chatbot Arena
This ranking relies on human votes, as opposed to other forms of benchmarking for AI models. With this one, people blind-rank the output of two different models to the same prompt.
The Chatbot Arena is run by LMSYS and features a host of large language models (LLMs) that are battling it out in “anonymous randomized battles.”
It was first launched last May and has collected more than 400,000 votes from users that have AI models from Google, Anthropic and OpenAI.
“LMSYS Chatbot Arena is a crowdsourced open platform for LLM evals. We’ve collected over 400,000 human preference votes to rank LLMs with the Elo ranking system,” said LMSYS.
The Elo system is mostly used in games like chess to evaluate the relative skill of a player. But in this case, the ranking is applied to the chatbot and “not the human using the model.”
Also read: Microsoft Reveals ‘First’ Surface PCs with Copilot AI Button
The shortcomings
The Chatbot Arena ranking is not short of faults. According to Tom’s Guide, it does not include all models or versions of models included while users sometimes have bad experiences with GPT-4 failing to load. It can also favor some models that have live internet access, for instance Google Gemini Pro.
While other models like those from French AI startup Mistral and Chinese firms like Alibaba have recently made their way on top spots on the arena in addition to open-source models, the arena still misses some high profile models. For instance, it is missing models like Google’s Gemini Pro 1.5