Skip to content

Generative AI can drive inclusive progress, if we fill critical gaps. Our Digital Donors Exchange discussed how we can make this happen.

|
5 mins read

GenAI continues to be a hot topic – both inside and outside of the typical technology circles. And yet, the impacts of generative AI on society, both positive and negative, remain uncertain. 

Yet one point is already abundantly clear – the data used to train large generative AI (GenAI) models is primarily in English and mainly collected from sources that are not representative of the world’s diversity. Without thoughtful and intentional action to diversify the languages and people represented in training data, AI may fail to benefit under-served populations or, even worse, exacerbate existing inequalities.   

Happily, we are seeing a flurry of innovations to make GenAI relevant to those who could benefit from it most, both by gathering new data and training agile AI models to reflect different contexts and the different ways that people communicate. 

Recognizing this fast-moving pace of innovation, we brought together our Digital Donors Exchange (DDX) community to discuss cutting edge ways to make GenAI work in low-resource contexts. We kicked off this discussion with two speakers working on the frontlines of building GenAI tools to support real-world problems, Daniel Wilson from XRI Global and Nikhil Kumar of Sesame LLM. 

With a virtual room filled with donors and practitioners at the forefront of contextualizing GenAI models, the conversation centered around three key questions.

1. What are the key considerations when contextualizing GenAI?

Inclusion: With the right design, AI can bridge the gap between digital and physical assistance. This starts with ensuring that GenAI services are voice-enabled: not only is this a requirement for reaching semi- and illiterate communities, but it is also the preferred method globally for interacting with digital services – and a must for under-resourced communities. This ties closely to the efforts underway to ensure that GenAI services understand a wide variety of languages, dialects, and even colloquialisms.   

Training: Gaps in global connectivity and access have resulted in a digital skills gap. For example, of the world’s 20 countries with the weakest digital skills, 12 are in Africa, and only 11% of Africa’s tertiary education graduates have formal digital training. This is a clear obstacle to effectively customizing GenAI models and use cases. Participants discussed the need to expand current efforts to strengthen capacity by introducing data and LLMs to both technical and non-technical groups. Effective training, in such a quickly evolving field, should include topics like problem identification, structuring datasets, how to involve both technical and non-technical groups across a project lifecycle, curation of data sources, training effective and non-bias models, and fine-tuning. 

Regulation: The discussion also covered the need to customize regulatory approaches to GenAI. Participants put forth ideas including regulating GenAI models as digital public goods, similar to Linux or Python, as they can be integrated into various applications and augment human problem-solving capabilities. Following this line of reasoning, some suggested that GenAI should be regulated at the application level, rather than the model level. In other words, as LLMs are increasingly used in software development, it will be important to address security within use cases and applications which are client-facing. The argument being that while models themselves will be increasingly impossible to monitor, the applications are the most effective point of governance.

2. Does every country need its own GenAI model?

The short answer? No. Or, rather, it depends. Before trying to create a LLM from scratch, countries should consider what they hope to achieve for specific groups of people before moving forward with a targeted approach. Most of these problems, like financial advice or accurate health diagnoses, don’t need the superintelligence of an LLM. Rather, they can be better addressed with smaller, fit-for-purpose models.    

It’s possible that just as computing evolved from massive mainframes to ubiquitous smartphones, LLMs will diversify into numerous specialized applications, not dependent on a single supermodel. Following this analogy, every country doesn’t need its own app on the smartphone, but the global population does need a large diversity of apps that work for different people with unique needs and capabilities.  

3. What are different approaches to creating contextualized GenAI models?

The discussion covered three options for creating context-specific GenAI models: fine-tuning, pre-training, and synthetic data. Fine-tuning entails taking an existing model and training a smaller amount of data in that language. This doesn’t update the entire model but rather focuses on the last few layers and controls the output of the model.  

However, many are finding that fine-tuning is insufficient to overcome the extensive bias within existing LLMs. Thus, they are turning to pre-training, which starts by building a new foundation for the new model, so that it understands the relevant language and context at its core. Unfortunately, this approach is costly and challenging. To pre-train an LLM, one would ideally have 500 million tokens (the fundamental unit of analysis for the algorithms behind AI – in the case of natural language processing, tokens are words or sub-words.)  This rules out most of the languages on the planet based on existing available data – for example, Afrikaans, a medium-resource language, makes up 0.1% of the internet. 

A third option is to use synthetic data to complement existing data sets. There are experiments underway to collect all data available for medium-resourced languages, such as Afrikaans, and then to augment them with synthetic data. Other applications may find synthetic data less effective, relying on real ground truth data for accurate evaluation. Regardless, raw, ground-truthed data collection will always be needed to evaluate synthetic data. 

In the future, foundational models may be able to create smaller models for specific purposes, pulling only the relevant data to address the bias inherent to large and fine-tuned models.

Efforts to evolve GenAI are helping more people shape – and benefit from – the cutting-edge technology.

GenAI offers the promise of benefits. But at the same time, GenAI could simply serve to reinforce and / or exacerbate existing inequalities – without concerted efforts to address the risks while capitalizing on the opportunities. 

Looking ahead, participants expressed optimism about localization efforts, suggesting that the language barrier could potentially disappear in the next 10 to 15 years. The global community can make this possibility a reality, by supporting efforts to test, refine, and scale lower-cost approaches to localizing the models underlying GenAI.   

Learn more about the Digital Donors Exchange, including the other topics covered this year.