Subscribe
About

Local AI model is melting pot for African languages

Simnikiwe Mzekandaba
By Simnikiwe Mzekandaba, IT in government editor
Johannesburg, 01 Aug 2024
Lelapa AI CEO and co-founder Pelonomi Moiloa. (Photograph by Strike A Pose Studios)
Lelapa AI CEO and co-founder Pelonomi Moiloa. (Photograph by Strike A Pose Studios)

South African artificial intelligence (AI) research and product lab Lelapa AIhas introduced its large language model (LLM) that aims to pioneer five African languages, as a start.

The InkubaLM natural language processing (NLP) model has been designed to support and enhance low-resource African languages – Swahili, Yoruba, isiXhosa, Hausa and isiZulu – with approximately 364 million speakers.

According to Lelapa AI, InkubaLM (Dung Beetle Language Model) is a robust, compact model created to serve African communities, without requiring extensive resources. It consists of two datasets: Inkuba-Mono and Inkuba-Instruct.

Inkuba-Mono is a monolingual dataset collected from open source repositories in five African languages, along with English and French data, to pre-train InkubaLM models.

“We collected open source datasets for five African languages from repositories on Hugging Face, Github and Zenodo. After pre-processing, researchers used 1.9 billion tokens of data to train the InkubaLM models,” says the start-up.

Lelapa AI notes Inkuba-Instruct currently provides tools for translation, transcription and natural language processing.

“The instruction dataset focused on machine translation, sentiment analysis, named entity recognition, parts of speech tagging, question answering and news topic classification. For each task, we covered five African languages: Hausa, Swahili, Zulu, Yoruba and Xhosa.”

Lelapa AI credits Microsoft’s AI For Good Lab, saying its AI’s researchers were able to get the compute credit for training the InkubaLM model.

“As AI practitioners, we are committed to forging an inclusive future through the power of AI. No one should have to assimilate to a culture outside of their own in order to access cutting-edge technology,” comments Pelonomi Moiloa, CEO and co-founder of Lelapa AI.

“While AI holds the promise of global prosperity, the challenge lies in the resources required for large models, which are often out of reach for the majority of the world. Open source models have attempted to bridge this gap, but much more can be done to make models cost-effective, accessible and locally relevant.”

With the release of natural language processing tools like ChatGPT, calls have been amplified to dismantle some of the barriers when it comes to the development of these AI tools for African audiences.

This is spurred by the ethical challenges that have arisen as a result of AI’s rapid advancement. Notable among these is bias in algorithms, privacy infringements and more recently language, as it’s feared AI’s rapid advancement will have the potential to leave much of Africa’s 1.2 billion population behind.

The African continent is a multilingual melting pot, with as many as 3 000 languages spoken. For example, South Africa has 12 official languages, but only one in 10 South Africans speak English at home – the language that dominates the internet. Throughout the rest of the continent, languages include Arabic, French Creole, Shona, Swahili and Swati, to name a few.

For Lelapa AI, the Inkuba release aims to enhance language model capabilities for African languages through two key initiatives.

“First, InkubaLM is introduced as a new model that can be further trained and developed to improve functionality in a variety of tasks for the languages in question. Second, the Inkuba datasets are available to enhance the performance of existing models.

“Given that conventional large language models perform poorly with these languages, Inkuba provides NLP practitioners with effective options to achieve robust functionality for the five targeted languages.

“InkubaLM is an autoregressive model trained to predict the next token, so it can be used for a variety of tasks, such as text generation. It can also be used as a base to perform any downstream NLP tasks using zero-shot or few-shot learning.

“The Inkuba-Mono dataset can be utilised to train language models to perform tasks that require monolingual datasets. The Inkuba-Instruct dataset can be utilised to instruct and fine-tune any language model for the five African languages of interest.”

Started in 2022, Lelapa AI was founded out of the need to address how AI can be used for solutions and applications through an African lens. It develops speech recognition tools for African languages.

The start-up’s founding members include Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman, Pravesh Ranchod and George Konidaris − all with backgrounds in academics, research, data science and engineering.

Share