ChatGPT’s Performance Declines: The Quest for Balance

July 21, 2023

ChatGPT, the AI phenomenon that made a dazzling entrance in November 2022 has exhibited a decline in performance over recent months, a UC Berkeley study reveals.

Once celebrated for its exceptional math and coding abilities, the AI chatbot has shown a significant drop in proficiency, prompting experts to delve into potential causes for this perplexing dip.

OpenAI’s efforts towards a safer chatbot may have triggered the downgrade

The decline could result from OpenAI’s diligent efforts to refine the system for safety. The apparent regression, they argue, might be an unintended consequence of measures intended to prevent the AI bot from responding to dangerous queries. Consequently, the chatbot’s once crisp and direct responses have become lengthy and indirect.

We evaluated #ChatGPT's behavior over time and found substantial diffs in its responses to the *same questions* between the June version of GPT4 and GPT3.5 and the March versions. The newer versions got worse on some tasks. w/ Lingjiao Chen @matei_zaharia https://t.co/TGeN4T18Fd https://t.co/36mjnejERy pic.twitter.com/FEiqrUVbg6

— James Zou (@james_y_zou) July 19, 2023

Significantly, this study was more than merely theoretical as it systematically evaluated different ChatGPT versions using stringent benchmarks. In particular, the researchers focused on the bot’s competence in math, coding, and visual reasoning tasks.

Moreover, the results concerned a math challenge requiring identifying prime numbers. ChatGPT exhibited a steep decline in performance. In March, the bot scored a remarkable 97.6% accuracy, but by June, this had plummeted to a mere 2.4%.

Additionally, the bot’s software coding capabilities had similarly dropped. The percentage of direct executable generations had shrunk from 52% in March to a worrying 10% in June.

However, not all areas saw a dramatic decline, evidenced by the bot’s performance on visual reasoning tasks, although diminished, did not see a drastic drop.

Community involvement and continuous benchmarking may be the key

The study’s findings resonated with the user community, who expressed frustration over the declining quality of ChatGPT’s responses. Experts have highlighted the need for continuous benchmarking and community involvement in open-source models like Meta’s LLaMA to identify and rectify issues early on.

AI expert Santiago Valderrama has proposed the possibility that a “cheaper and faster” mix of models might have replaced the original ChatGPT.

GPT-4 is getting worse over time, not better.

Many people have reported noticing a significant degradation in the quality of the model responses, but so far, it was all anecdotal.

But now we know.

At least one study shows how the June version of GPT-4 is objectively worse than… pic.twitter.com/whhELYY6M4

— Santiago (@svpino) July 19, 2023

Meanwhile, Dr Jm Fan speculates that a focus on safety improvements might have come at the cost of less usefulness, citing the introduction of warnings and disclaimers as factors that could have “dumbed down” the model.

Despite the concerns, this study also paves the way for improvement. More rigorous testing, user feedback, and a focus on balancing safety with functionality could restore ChatGPT to its former glory. Consequently, the AI giant’s future looks challenging yet full of opportunities.

One supporter of the OpenAI approach, currently drafting an academic paper on the topic, commends the company on Twitter for its commitment to quality responses. They argue that GPT-4 responds in kind to the quality of the user prompt, thereby making it a personalized tool for all.

In conclusion, while the road to balancing safety and functionality might be a tough one, it is manageable. OpenAI, with its robust approach and community involvement, might bring back the ChatGPT we all marvelled at.