A team of academics, researchers, and engineers in the United Arab Emirates (UAE) unveiled a new generative AI chatbot called ‘Jais’ to cater specifically to people who speak the Arabic language around the world, according to CNN.
The team argues that Arabic, the sixth most spoken language in the world with around 272 million speakers, has been “underrepresented in mainstream AI.” They are hoping to end the dominance of English in training AI systems known as large language models (LLMs).
Jais AI chatbot focuses on Middle East
The language issue in AI is a worldwide concern. Japan recently ditched English as the country builds its own version of ChatGPT. Researchers said while OpenAI’s chatbot excels in English, it often falls short in Japanese “due to differences in the alphabet system, limited data, and other factors.”
Jais is named after a mountain in the UAE, the CNN report says, and can perform tasks on command, such as writing poems, just like ChatGPT or Google’s Bard, but on a limited scale. The AI is trained on 13 billion parameters of data, a far cry compared to ChatGPT 3.5’s near 175 billion parameters, a measure of the size of a large language model, but not its accuracy.
There are plans to expand Jais’ dataset to 30 billion parameters and enable it to read images and graphs instead of just text, according to Timothy Baldwin, a professor of natural language processing at Abu Dhabi’s Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).
The university worked with Silicon Valley’s Cerebras Systems and Inception, a subsidiary of UAE-based AI firm G42, to create Jais. Baldwin said while rival LLMs like Meta’s LLaMA and OpenAI’s GPT can understand Arabic, they are predominantly trained on online English data.
For Jais, the training involved a combination of both English and Arabic datasets, but with a deliberate focus on content from the Middle East, where Arabic is widely spoken and written.
Baldwin said such a focus allows the AI chatbot to go beyond “what anyone else has been able to achieve for Arabic.”
According to MBZUAI, Jais’ unique training helps the chatbot ‘understand cultural nuances and dialects,’ making it more useful for a wide range of different industries. Developers made the model available to the public in open source, meaning anyone can customize it.
Switching between dialects
Baldwin told CNN that Jais’ diverse data training will enable it to switch between dialects of Modern Standard Arabic, which is used for official documents and formal writing, and local dialects that are usually used on blogs or social media.
“There’s certainly room for improvement there, but the focus has been more on the robustness in terms of being able to understand if we do have more informal inputs to the model,” he said.
Like other generative AI chatbots, Jais is built to resist prompts that create “toxic or harmful” answers, Baldwin said, and will not respond to queries that “lead to self-harm or are suggestive of addiction.” Topics such as homosexuality are out of bounds, in line with Muslim beliefs.
According to Mohammed Soliman, director of strategic technologies and the cyber security program at the Middle East Institute in Washington, DC, Latin alphabet-based languages like English dominate the internet, meaning datasets are the largest in those languages.
“Making access to AI tools exclusive to those who can speak specific languages could prevent disadvantaged cross-sections of societies from reaping the benefits of AI,” he said.
“[These LLMs] lack awareness of other cultures, adversely affecting the user experience for people of diverse backgrounds,” Soliman added, as reported by CNN.
The UAE has made significant strides in developing generative AI systems. The Emirate was the first country in the world to appoint a minister of AI in 2017. It also reportedly boasts the region’s largest generative AI model, Falcon, which was released by Abu Dhabi’s Advanced Technology Research Council and the Technology Innovation Institute (TII) in March.